Chapter 01
The Cost of Bigger
It's tempting to think a model just gets smarter the bigger you make it. It does — but along a curve that flattens. Capability grows roughly with the logarithm of scale: each time you multiply the compute by ten, you buy a fixed, modest bump in quality, and the bumps keep shrinking as you approach the limit of what the training data can teach.
Meanwhile, the cost of running a model grows almost linearly with its size — and you pay that cost on every single query, forever. A model twice as big costs roughly twice as much per answer while being only a little better. That mismatch is the whole story.
There's also a supply problem. To justify a bigger model you must feed it proportionally more text — a rough rule of thumb is about 20 words of training data per parameter. Push far enough and you simply run out of good writing: a model with trillions of parameters would "want" more clean text than humanity has ever produced. This is the data wall.
Cost grows about linearly with size; capability grows about logarithmically. There's no cliff — just a point past which you keep paying a lot more for a little more.
That's why the field stopped only scaling up and got cleverer: mixture-of-experts (only a slice of the parameters fire per word, so you get capacity without paying full cost), and spending compute at answer time (reasoning models) instead of only making the base model larger.