The Transformer
Inside the engine — self-attention, multi-head attention, and the layer stack that makes LLMs possible.
Self-Attention
Self-attention is the mechanism that lets a transformer weigh the relevance of every token in a sequence when processing any given token. Without it, the model would process words in isolation — it could never resolve pronouns, track entities across paragraphs, or understand long-range dependencies.
Each token produces three vectors: a Query ("what am I looking for?"), a Key ("what do I represent?"), and a Value ("what information do I carry?"). Attention scores are computed as the dot product of Query and Key vectors, scaled, and passed through softmax to produce a weighted sum of Values.
In multi-head attention, the model runs multiple attention operations in parallel — each "head" learns to attend to different linguistic features: one head might track syntactic structure, another coreference, a third sentiment. The results are concatenated and projected back to the embedding dimension.
Why is the attention score scaled by 1/√dₖ?
The transformer stack
A transformer is built from stacked identical blocks, each containing two sub-layers: a multi-head self-attention layer followed by a position-wise feed-forward network (MLP). Residual connections wrap each sub-layer, and layer normalization stabilizes training.
The MLP block is deceptively simple: two linear projections with an activation function (typically GELU or SwiGLU) in between. It operates on each position independently — after attention mixes information across positions, the MLP transforms the representation at each position.
Attention sub-layer
Mixes information across positions — 'which other tokens matter for this one?'
MLP sub-layer
Transforms each position independently — 'given all the mixed info, what's the best representation?'
Residual connection
Adds the input back to the output (x = x + f(x)) — enables training very deep networks by preserving the gradient path.
Layer norm
Normalizes activations across the feature dimension — stabilizes training and accelerates convergence.
What is the primary purpose of residual connections in a transformer stack?
Context windows
The context window is the maximum number of tokens a model can attend to at once — essentially its working memory. Early models like GPT-2 had 1,024 tokens; modern models like Gemini 2.5 and Claude can handle 1–2 million tokens. More context means the model can process entire codebases, books, or multi-hour conversations.
Transformers are inherently position-agnostic — attention has no notion of token order. To fix this, positional encodings are added to the token embeddings before the first layer. Classic transformers use sinusoidal encodings; modern models often use rotary position embeddings (RoPE), which encode relative position via rotation of the Q and K vectors.
| Technique | How it works | Strength | Limitation |
|---|---|---|---|
| Sinusoidal | Fixed sine/cosine waves at different frequencies per dimension | No learned parameters; extrapolates beyond training length | Absolute position — harder to generalize relative positions |
| Learned | Trainable embedding per position index | Flexible; model learns position patterns | Doesn't extrapolate beyond max training length |
| RoPE | Rotates Q and K vectors by an angle proportional to position | Encodes relative position naturally; best extrapolation | Increases compute slightly per attention head |
| ALiBi | Subtracts a bias from attention scores based on token distance | Simple; strong length extrapolation | Less expressive than RoPE for complex position tasks |
Why do transformers need positional encoding?