How a Language Model Actually Works

Chapter 01

The Cost of Bigger

It's tempting to think a model just gets smarter the bigger you make it. It does — but along a curve that flattens. Capability grows roughly with the logarithm of scale: each time you multiply the compute by ten, you buy a fixed, modest bump in quality, and the bumps keep shrinking as you approach the limit of what the training data can teach.

Meanwhile, the cost of running a model grows almost linearly with its size — and you pay that cost on every single query, forever. A model twice as big costs roughly twice as much per answer while being only a little better. That mismatch is the whole story.

Fig. 1 — Capability (teal) bends and flattens; cost (amber) keeps climbing in a straight line. The gap between them is why "just make it bigger" stops paying off.

There's also a supply problem. To justify a bigger model you must feed it proportionally more text — a rough rule of thumb is about 20 words of training data per parameter. Push far enough and you simply run out of good writing: a model with trillions of parameters would "want" more clean text than humanity has ever produced. This is the data wall.

Key idea

Cost grows about linearly with size; capability grows about logarithmically. There's no cliff — just a point past which you keep paying a lot more for a little more.

That's why the field stopped only scaling up and got cleverer: mixture-of-experts (only a slice of the parameters fire per word, so you get capacity without paying full cost), and spending compute at answer time (reasoning models) instead of only making the base model larger.

Chapter 02

What a Parameter Is

A parameter is not a fact the model stores. It's an adjustable number it multiplies by — one knob in a gigantic arithmetic pipeline. They're floating-point numbers (usually 16-bit, often squeezed to 8 or 4 bits to save memory). "A trillion parameters" means a trillion of these little multipliers, all dialed in during training.

The first thing the model does with your text is turn each word into a vector of numbers — an embedding. And here's the magic: directions in that space carry meaning. The classic demonstration is arithmetic on words.

Fig. 2 — Meaning lives in directions. Moving "up" adds female-ness; moving "right" adds royalty. So king − man + woman ≈ queen.

Those word-vectors are learned numbers, so they count as parameters too — but they're a tiny slice. If the vocabulary is ~100,000 words and each gets a few thousand coordinates, all the word-vectors combined are well under a billion parameters. A model with hundreds of billions spends the other ~99% on the shared engine: the machinery that takes those vectors and transforms them, the same way for every word. "cat", "democracy", and "the" all flow through the identical knobs.

Key idea

You don't spend parameters per word. Each word is a small vector; the trillion parameters are mostly the reusable engine that manipulates those vectors. And no single knob means anything on its own — meaning lives in vast combinations, which is exactly why these models are so hard to interpret.

Chapter 03

Inside the Neuron

Everything is built from one tiny operation, repeated billions of times. A neuron takes several input numbers, multiplies each by its own weight, adds them up, adds a bias, and passes the result through a "squish" function. Those weights and the bias are the parameters:

output = squish( w₁·x₁ + w₂·x₂ + w₃·x₃ + … + b )

A concrete pass: inputs [1.0, 2.0, 0.5], weights [0.3, −0.1, 0.8], bias 0.2 →

(1.0×0.3) + (2.0×−0.1) + (0.5×0.8) + 0.2 = 0.7 squish (keep if positive) → 0.7

So this neuron emits 0.7. The weights are what make it care a lot about input #3 and slightly suppress input #2. Training is just the search for good values for all these weights.

Fig. 3 — Every line is one parameter. Three inputs into four neurons is already 3×4 = 12 weights. Real layers have thousands of each, so a single layer holds millions.

A layer is just many neurons side by side, each with its own private weights. That bulk "multiply everything by everything and sum" is exactly a matrix multiplication — which is why GPUs, built for matrix math, run AI. Stack dozens of layers (the output of one feeds the next) and you get depth.

One precise point that's easy to get wrong: a single neuron outputs one number, not a vector. The vector handed to the next layer is built by the whole layer together — one number contributed by each neuron.

Fig. 4 — Neuron = one number. Layer = the vector. Each neuron reads the entire input vector and contributes exactly one slot of the output.

Chapter 04

The Sentence Is a Grid

Here's the assumption that trips everyone up: the input to the model is not one vector. It's one vector per token, sitting side by side. With six tokens you have six separate vectors. They never merge, and they never average into one.

Picture it as a grid: rows are tokens, and each row is that token's vector. In matrix terms, six tokens with a 4096-wide vector each is a 6 × 4096 grid. Every layer transforms all the rows in parallel.

Fig. 5 — Rows are tokens, columns are the ~4096 features. "Appending" a word just adds a new row (amber). The grid grows taller; nothing merges.

So why did "add the vectors up and average them" feel right? Because summing does happen — just in narrow places inside the machinery, never to the sequence as a whole. A neuron sums its weighted inputs; a token (in the next chapter) pulls a weighted sum of others' information into its own row. But the grid always stays a grid: N tokens means N separate vectors, start to finish.

Chapter 05

Attention: How Tokens Talk

A word's meaning depends on its neighbors — bank means different things next to river versus money. The mechanism that lets tokens share context is attention, and it's the soul of the whole architecture.

It starts by giving every token three new vectors, each made by multiplying that token's vector by a learned matrix. These are not retrieved from a database — they're computed on the spot:

Query = vector × W_Q "what am I looking for?" Key = vector × W_K "what do I advertise?" Value = vector × W_V "what do I hand over?"

Fig. 6 — The only lookup in the whole model is the embedding table. Everything after, including Q/K/V, is arithmetic. The "knowledge" lives in the matrices, not in retrievable vectors.

To process bank: take its Query and dot-product it against every token's Key. A big dot product means a strong match. Softmax turns those scores into weights that sum to 100%, then the token pulls in a weighted blend of every token's Value — dominated by whoever it matched.

Fig. 7 — bank asks, river answers, bank absorbs the answer. Its vector shifts toward "riverbank." Every token attends to all tokens, itself included.

Do that for every token at once and you get the full web of connections — the attention matrix. Read each row as "this token's attention, spread across all tokens" (the rows sum to 100%).

Fig. 8 — The whole grid is "how tokens connect." The bright bank → river cell is our disambiguation; the warm diagonal is tokens partly attending to themselves.

Now zoom into just the bank row and watch what it does with those weights: multiply each token's Value by its weight, sum them, and add the blend to bank's own vector.

Fig. 9 — This is the weighted average you might have guessed at — but the weights are learned and content-dependent (river .69, not a flat .33), it's the Value vectors being mixed, and it updates one token's own slot.

Every token does this simultaneously, each with its own weights — every row of the matrix is a different token enriching itself. And it repeats at every layer. This is what makes meaning context-dependent: before attention, bank's vector is the generic dictionary entry; attention specializes it based on the company it keeps.

Key idea

Nothing is replaced or averaged away. Each token keeps its slot and gets nudged by context — which is why bank can absorb river without the two ever merging into one vector.

Chapter 06

The Residual Stream

The natural mental model of depth is a relay: layer 1 hands a new vector to layer 2, which hands a newer one to layer 3, each replacing what came before. Modern transformers don't work that way — and the difference is one of the most important ideas in the whole design.

Each layer reads the running vector, computes its contribution, and adds it back. This is the skip connection (or residual connection). Picture a single vector flowing straight through the entire stack — the residual stream — with each layer writing an increment into it rather than overwriting it.

Fig. 10 — A layer has two halves: attention (tokens mix), then feed-forward (each token alone). Both results are added back. The vector at layer 30 is the original embedding plus every layer's contribution.

Two payoffs fall out of this. Information from early layers survives all the way to the end instead of being overwritten — the model can always "see back" to the raw meaning. And because each layer only has to learn a small adjustment to the stream, very deep networks stay trainable. The grid shape never changes: N tokens × ~4096 features, layer after layer.

Key idea

The "original" embedding only feeds layer 1 directly. Every layer after that reads the latest running total — the sum of the embedding and all contributions so far — and adds its own.

Chapter 07

Writing One Word at a Time

Now we can assemble the whole machine. A language model writes the way you'd autocomplete a sentence out loud: one word at a time, each new word chosen with all the previous ones in view. This is called autoregressive generation.

Run the entire grid through every layer. Then — and this is the surprising part — only the last row's final vector is turned back into a word. That one new token gets appended as a fresh row, and the whole grid runs through all the layers again. Repeat until the model emits a "stop."

Fig. 11 — The loop. Every other position is computed too, but only the last one's prediction is kept. The new word becomes part of the input for the next pass.

Here's the same thing zoomed out into the full pipeline, so you can see where each earlier chapter lives: embed text into vectors, run N layers (each one attention then feed-forward), then unembed the last position into one token — and loop.

Fig. 12 — The complete picture. Re-running every layer for every new word sounds wasteful, so the earlier tokens' Key and Value vectors are cached (the "KV cache") and only the new token is fully processed each step.

Chapter 08

From Vector Back to Word

One question closes the loop. If the last layer already produced a perfectly good vector, why turn it into a word and then immediately turn that word back into a vector to continue? Isn't the round-trip wasteful?

It isn't — and the reason is the deepest idea in the chapter. That final vector is not a word. It's a probability distribution over the entire vocabulary: "70% yes, 12% sure, 8% absolutely…" — a blur of many possible next words at once. Unembedding and picking is the moment the model commits to a single definite word.

Fig. 13 — You can't continue from a blur — every position in the sequence has to be a definite token. Picking is also the only place randomness (temperature, creativity) can enter.

Three reasons it isn't wasteful. First, the picking is the generation — feeding the blurry "70/12/8" vector back would condition the next step on a smear of all words at once, and the errors would compound into nonsense. Second, the output vector and the input embedding live in different spaces; the model was trained to read embeddings and to emit prediction vectors, and was never trained to consume its own output as input. Third, re-embedding is a table lookup — practically free next to the dozens of layers you run anyway.

And the lovely twist: that instinct — "why not feed the vector back?" — is a real, active research direction. Latent or continuous reasoning lets a model "think" in vector space across steps without collapsing to words each time. It has real trade-offs (you lose the error-correcting re-grounding, and you can't read what it's thinking), which is why standard text generation keeps the discrete round-trip — but the question points straight at a genuine frontier.

The whole loop, in one breath

Tokens become a grid of vectors → attention lets them share context → feed-forward refines each → skip connections keep a running stream → after all layers, the last position's vector is a distribution over words → pick one → append it → run the whole thing again. That's a language model writing a sentence.

Reference

Glossary

Parameter — one adjustable weight the model multiplies by. Set during training; a model "has a trillion parameters" the way an engine has a trillion tiny dials.

Token — a chunk of text (a word or word-piece) the model treats as one unit.

Embedding — the vector a token is turned into. Directions in this space carry meaning.

Vector — an ordered list of numbers. Every token is one; the model transforms them.

Neuron — the basic unit: multiply inputs by weights, add a bias, squish. Outputs one number.

Layer — many neurons side by side; together they turn one vector into the next. Mathematically a matrix multiply.

Attention — the step where tokens share context, each pulling in a weighted blend of the others.

Query / Key / Value — three vectors each token computes (via learned matrices). Query·Key sets the attention weights; Values are what gets blended.

Softmax — turns raw scores into positive weights that sum to 100%.

Residual stream — the running vector that flows through every layer, with each layer adding to it rather than replacing it.

Feed-forward — the half of a layer that processes each token on its own, after attention has mixed them.

Unembed — turning the final vector into scores over the vocabulary (a distribution), then a word.

Autoregressive — generating one token at a time, each conditioned on all the previous ones.

KV cache — the stored Key/Value vectors of earlier tokens, so they aren't recomputed every step.