Module 2 / The Transformer
Module 2 · Architecture

The Transformer

Inside the engine — self-attention, multi-head attention, and the layer stack that makes LLMs possible.

0/3 lessons

Self-Attention

Learn

Self-attention is the mechanism that lets a transformer weigh the relevance of every token in a sequence when processing any given token. Without it, the model would process words in isolation — it could never resolve pronouns, track entities across paragraphs, or understand long-range dependencies.

Each token produces three vectors: a Query ("what am I looking for?"), a Key ("what do I represent?"), and a Value ("what information do I carry?"). Attention scores are computed as the dot product of Query and Key vectors, scaled, and passed through softmax to produce a weighted sum of Values.

Attention(Q, K, V) = softmax(Q × Kᵀ / √dₖ) × V Q = Query — what this token is looking for K = Key — what each token offers as a match V = Value — the actual content to aggregate √dₖ — scaling factor (prevents large dot products)

In multi-head attention, the model runs multiple attention operations in parallel — each "head" learns to attend to different linguistic features: one head might track syntactic structure, another coreference, a third sentiment. The results are concatenated and projected back to the embedding dimension.

Self-attention is quadratic in sequence length — doubling the context roughly quadruples the compute. This is why long-context models require optimizations like FlashAttention, grouped-query attention, or sliding windows.
Practice
Loading…
Recall0/1
Recall

Why is the attention score scaled by 1/√dₖ?

The transformer stack

Learn

A transformer is built from stacked identical blocks, each containing two sub-layers: a multi-head self-attention layer followed by a position-wise feed-forward network (MLP). Residual connections wrap each sub-layer, and layer normalization stabilizes training.

The MLP block is deceptively simple: two linear projections with an activation function (typically GELU or SwiGLU) in between. It operates on each position independently — after attention mixes information across positions, the MLP transforms the representation at each position.

For each transformer layer (× N): x = x + MultiHeadAttention(LayerNorm(x)) x = x + FFN(LayerNorm(x)) FFN(x) = W₂ × GELU(W₁ × x) Typical sizes: N = 12–96 layers, d_model = 768–8192

Attention sub-layer

Mixes information across positions — 'which other tokens matter for this one?'

MLP sub-layer

Transforms each position independently — 'given all the mixed info, what's the best representation?'

Residual connection

Adds the input back to the output (x = x + f(x)) — enables training very deep networks by preserving the gradient path.

Layer norm

Normalizes activations across the feature dimension — stabilizes training and accelerates convergence.

The residual connections are why transformers can be stacked so deep. Without them, gradients would vanish before reaching early layers — models beyond a few layers would fail to train.
Practice
Loading…
Recall0/1
Recall

What is the primary purpose of residual connections in a transformer stack?

Context windows

Learn

The context window is the maximum number of tokens a model can attend to at once — essentially its working memory. Early models like GPT-2 had 1,024 tokens; modern models like Gemini 2.5 and Claude can handle 1–2 million tokens. More context means the model can process entire codebases, books, or multi-hour conversations.

Transformers are inherently position-agnostic — attention has no notion of token order. To fix this, positional encodings are added to the token embeddings before the first layer. Classic transformers use sinusoidal encodings; modern models often use rotary position embeddings (RoPE), which encode relative position via rotation of the Q and K vectors.

TechniqueHow it worksStrengthLimitation
SinusoidalFixed sine/cosine waves at different frequencies per dimensionNo learned parameters; extrapolates beyond training lengthAbsolute position — harder to generalize relative positions
LearnedTrainable embedding per position indexFlexible; model learns position patternsDoesn't extrapolate beyond max training length
RoPERotates Q and K vectors by an angle proportional to positionEncodes relative position naturally; best extrapolationIncreases compute slightly per attention head
ALiBiSubtracts a bias from attention scores based on token distanceSimple; strong length extrapolationLess expressive than RoPE for complex position tasks
Longer context isn't free — attention's O(n²) complexity means a 128k context window requires 16,000× more compute than a 1k window. This is why efficient attention variants (FlashAttention, ring attention) are critical for frontier models.
Practice
Loading…
Recall0/1
Recall

Why do transformers need positional encoding?