Module 1 / How LLMs think
Module 1 · Foundation

How LLMs think

What large language models actually do — prediction, tokenization, and how meaning gets encoded as vectors.

0/3 lessons

What is an LLM?

Learn

A large language model (LLM) is a neural network trained to predict the next token in a sequence. When you type a prompt, it doesn't "understand" in the human sense — it computes the most probable continuation based on patterns learned from trillions of training tokens.

This is called autoregressive generation: the model predicts token₁, feeds it back into its own input, predicts token₂, and repeats. Each prediction is a probability distribution over the entire vocabulary — typically 50,000–250,000 possible tokens. The model picks one (via sampling) and continues.

An LLM has no database, no internet access, and no persistent memory between requests — every response is generated from scratch, one token at a time, conditioned on the prompt and any external tools connected to it.

LLMs are trained on a self-supervised task: given a massive corpus of text, the model learns to fill in missing or subsequent tokens. This pretraining produces a "base model" that can complete text but doesn't follow instructions well. Instruction tuning (RLHF or DPO) trains the model to be helpful, harmless, and to follow user intent.

Practice
Loading…
Recall0/1
Recall

What is the fundamental task an LLM performs at inference time?

Tokenization

Learn

An LLM cannot read raw text — it operates on numbers. Tokenization splits text into discrete units (tokens) and maps each to an integer ID. The model's vocabulary is a fixed dictionary of these token → ID mappings, built before training begins.

Modern LLMs use Byte-Pair Encoding (BPE) to build the vocabulary. BPE starts with individual characters and iteratively merges the most frequent adjacent pairs. This means common words get their own token (`"the"` → `[1037]`), while rare words split into subword pieces (`"tokenization"` → `["token", "ization"]`).

Text: "I love LLMs" Tokens: ["I", " love", " L", "L", "Ms"] IDs: [40, 1842, 321, 63, 8763] Each token has ONE vocabulary entry. Tokens ≠ words — capitalization, spaces, and punctuation all matter.
Tokenization is lossy — the same text can tokenize differently across models. GPT-4, Claude, and Llama each have their own tokenizer, which is why the same prompt can produce different token counts and behavior.
Practice
Loading…
Recall0/1
Recall

Why does tokenization sometimes split a single word into multiple tokens?

Embeddings

Learn

After tokenization, each token ID is converted into an embedding — a dense vector of floating-point numbers (e.g., 768 to 4096 dimensions). These vectors aren't random: during training, the model learns to position tokens with similar meanings close together in this high-dimensional space.

The distance between two embedding vectors — measured by cosine similarity — represents how semantically related the tokens are. `king − man + woman ≈ queen` is the famous example: vector arithmetic captures analogies because embeddings encode relationships as directions in space.

One-hot encoding

Size
50k dimensions (vocabulary size)
Relationships
None — every token is equally distant from every other
Efficiency
Extremely sparse (mostly zeros)

Learned embeddings

Size
768–4096 dimensions (configurable)
Relationships
Semantic — similar meanings cluster together
Efficiency
Dense — every dimension carries signal
Embeddings are the model's internal 'language.' Every LLM capability — translation, reasoning, coding — ultimately operates on these vectors. The transformer's job is to transform and contextualize them through attention.
Practice
Loading…
Recall0/1
Recall

What does cosine similarity between two embedding vectors measure?