How LLMs think · AI Fundamentals

What is an LLM?

Learn

A large language model (LLM) is a neural network trained to predict the next token in a sequence. When you type a prompt, it doesn't "understand" in the human sense — it computes the most probable continuation based on patterns learned from trillions of training tokens.

This is called autoregressive generation: the model predicts token₁, feeds it back into its own input, predicts token₂, and repeats. Each prediction is a probability distribution over the entire vocabulary — typically 50,000–250,000 possible tokens. The model picks one (via sampling) and continues.

An LLM has no database, no internet access, and no persistent memory between requests — every response is generated from scratch, one token at a time, conditioned on the prompt and any external tools connected to it.

LLMs are trained on a self-supervised task: given a massive corpus of text, the model learns to fill in missing or subsequent tokens. This pretraining produces a "base model" that can complete text but doesn't follow instructions well. Instruction tuning (RLHF or DPO) trains the model to be helpful, harmless, and to follow user intent.

Practice

Match the pairs0/4

Tap a left item, then its match on the right.

Loading…

Recall0/1

Recall

What is the fundamental task an LLM performs at inference time?

Tokenization

Learn

An LLM cannot read raw text — it operates on numbers. Tokenization splits text into discrete units (tokens) and maps each to an integer ID. The model's vocabulary is a fixed dictionary of these token → ID mappings, built before training begins.

Modern LLMs use Byte-Pair Encoding (BPE) to build the vocabulary. BPE starts with individual characters and iteratively merges the most frequent adjacent pairs. This means common words get their own token (`"the"` → `[1037]`), while rare words split into subword pieces (`"tokenization"` → `["token", "ization"]`).

Text: "I love LLMs" Tokens: ["I", " love", " L", "L", "Ms"] IDs: [40, 1842, 321, 63, 8763] Each token has ONE vocabulary entry. Tokens ≠ words — capitalization, spaces, and punctuation all matter.

Tokenization is lossy — the same text can tokenize differently across models. GPT-4, Claude, and Llama each have their own tokenizer, which is why the same prompt can produce different token counts and behavior.

Practice

Fill in the blanks

Raw text → into characters → Merge frequent → Build → Map text to token

Loading…

Recall0/1

Recall

Why does tokenization sometimes split a single word into multiple tokens?

Embeddings

Learn

After tokenization, each token ID is converted into an embedding — a dense vector of floating-point numbers (e.g., 768 to 4096 dimensions). These vectors aren't random: during training, the model learns to position tokens with similar meanings close together in this high-dimensional space.

The distance between two embedding vectors — measured by cosine similarity — represents how semantically related the tokens are. `king − man + woman ≈ queen` is the famous example: vector arithmetic captures analogies because embeddings encode relationships as directions in space.

One-hot encoding

Size: 50k dimensions (vocabulary size)
Relationships: None — every token is equally distant from every other
Efficiency: Extremely sparse (mostly zeros)

Learned embeddings

Size: 768–4096 dimensions (configurable)
Relationships: Semantic — similar meanings cluster together
Efficiency: Dense — every dimension carries signal

Embeddings are the model's internal 'language.' Every LLM capability — translation, reasoning, coding — ultimately operates on these vectors. The transformer's job is to transform and contextualize them through attention.

Practice

cos_sim = dot(A, B) / (|A| × |B|) Range: −1 (opposite) to +1 (identical)

Dot product (A · B)

Magnitude |A|

Magnitude |B|

Cosine similarity

0.85

Highly related — these vectors point in nearly the same direction.

Loading…

Recall0/1

Recall

What does cosine similarity between two embedding vectors measure?