The Transformer · AI Fundamentals

Self-Attention

Learn

Self-attention is the mechanism that lets a transformer weigh the relevance of every token in a sequence when processing any given token. Without it, the model would process words in isolation — it could never resolve pronouns, track entities across paragraphs, or understand long-range dependencies.

Each token produces three vectors: a Query ("what am I looking for?"), a Key ("what do I represent?"), and a Value ("what information do I carry?"). Attention scores are computed as the dot product of Query and Key vectors, scaled, and passed through softmax to produce a weighted sum of Values.

Attention(Q, K, V) = softmax(Q × Kᵀ / √dₖ) × V Q = Query — what this token is looking for K = Key — what each token offers as a match V = Value — the actual content to aggregate √dₖ — scaling factor (prevents large dot products)

In multi-head attention, the model runs multiple attention operations in parallel — each "head" learns to attend to different linguistic features: one head might track syntactic structure, another coreference, a third sentiment. The results are concatenated and projected back to the embedding dimension.

Self-attention is quadratic in sequence length — doubling the context roughly quadruples the compute. This is why long-context models require optimizations like FlashAttention, grouped-query attention, or sliding windows.

Practice

Put in order

First step → Last step

1Compute attention scores: Q × Kᵀ
2Scale scores by 1/√dₖ
3Apply softmax to get attention weights
4Compute weighted sum of Values
5Project input into Q, K, V matrices

Loading…

Recall0/1

Recall

Why is the attention score scaled by 1/√dₖ?

The transformer stack

Learn

A transformer is built from stacked identical blocks, each containing two sub-layers: a multi-head self-attention layer followed by a position-wise feed-forward network (MLP). Residual connections wrap each sub-layer, and layer normalization stabilizes training.

The MLP block is deceptively simple: two linear projections with an activation function (typically GELU or SwiGLU) in between. It operates on each position independently — after attention mixes information across positions, the MLP transforms the representation at each position.

For each transformer layer (× N): x = x + MultiHeadAttention(LayerNorm(x)) x = x + FFN(LayerNorm(x)) FFN(x) = W₂ × GELU(W₁ × x) Typical sizes: N = 12–96 layers, d_model = 768–8192

Attention sub-layer

Mixes information across positions — 'which other tokens matter for this one?'

MLP sub-layer

Transforms each position independently — 'given all the mixed info, what's the best representation?'

Residual connection

Adds the input back to the output (x = x + f(x)) — enables training very deep networks by preserving the gradient path.

Layer norm

Normalizes activations across the feature dimension — stabilizes training and accelerates convergence.

The residual connections are why transformers can be stacked so deep. Without them, gradients would vanish before reaching early layers — models beyond a few layers would fail to train.

Practice

Match the pairs0/4

Tap a left item, then its match on the right.

Loading…

Recall0/1

Recall

What is the primary purpose of residual connections in a transformer stack?

Context windows

Learn

The context window is the maximum number of tokens a model can attend to at once — essentially its working memory. Early models like GPT-2 had 1,024 tokens; modern models like Gemini 2.5 and Claude can handle 1–2 million tokens. More context means the model can process entire codebases, books, or multi-hour conversations.

Transformers are inherently position-agnostic — attention has no notion of token order. To fix this, positional encodings are added to the token embeddings before the first layer. Classic transformers use sinusoidal encodings; modern models often use rotary position embeddings (RoPE), which encode relative position via rotation of the Q and K vectors.

Technique	How it works	Strength	Limitation
Sinusoidal	Fixed sine/cosine waves at different frequencies per dimension	No learned parameters; extrapolates beyond training length	Absolute position — harder to generalize relative positions
Learned	Trainable embedding per position index	Flexible; model learns position patterns	Doesn't extrapolate beyond max training length
RoPE	Rotates Q and K vectors by an angle proportional to position	Encodes relative position naturally; best extrapolation	Increases compute slightly per attention head
ALiBi	Subtracts a bias from attention scores based on token distance	Simple; strong length extrapolation	Less expressive than RoPE for complex position tasks

Longer context isn't free — attention's O(n²) complexity means a 128k context window requires 16,000× more compute than a 1k window. This is why efficient attention variants (FlashAttention, ring attention) are critical for frontier models.

Practice

Sequence length (thousands)8 K tokens

Attention ops (relative)

64

quadratic growth

Rough VRAM

$0.3

GB, approximate

% of 128K max

O(n²) attention: double the sequence = quadruple the compute.

Loading…

Recall0/1

Recall

Why do transformers need positional encoding?