How LLMs think
What large language models actually do — prediction, tokenization, and how meaning gets encoded as vectors.
What is an LLM?
A large language model (LLM) is a neural network trained to predict the next token in a sequence. When you type a prompt, it doesn't "understand" in the human sense — it computes the most probable continuation based on patterns learned from trillions of training tokens.
This is called autoregressive generation: the model predicts token₁, feeds it back into its own input, predicts token₂, and repeats. Each prediction is a probability distribution over the entire vocabulary — typically 50,000–250,000 possible tokens. The model picks one (via sampling) and continues.
LLMs are trained on a self-supervised task: given a massive corpus of text, the model learns to fill in missing or subsequent tokens. This pretraining produces a "base model" that can complete text but doesn't follow instructions well. Instruction tuning (RLHF or DPO) trains the model to be helpful, harmless, and to follow user intent.
What is the fundamental task an LLM performs at inference time?
Tokenization
An LLM cannot read raw text — it operates on numbers. Tokenization splits text into discrete units (tokens) and maps each to an integer ID. The model's vocabulary is a fixed dictionary of these token → ID mappings, built before training begins.
Modern LLMs use Byte-Pair Encoding (BPE) to build the vocabulary. BPE starts with individual characters and iteratively merges the most frequent adjacent pairs. This means common words get their own token (`"the"` → `[1037]`), while rare words split into subword pieces (`"tokenization"` → `["token", "ization"]`).
Why does tokenization sometimes split a single word into multiple tokens?
Embeddings
After tokenization, each token ID is converted into an embedding — a dense vector of floating-point numbers (e.g., 768 to 4096 dimensions). These vectors aren't random: during training, the model learns to position tokens with similar meanings close together in this high-dimensional space.
The distance between two embedding vectors — measured by cosine similarity — represents how semantically related the tokens are. `king − man + woman ≈ queen` is the famous example: vector arithmetic captures analogies because embeddings encode relationships as directions in space.
One-hot encoding
- Size
- 50k dimensions (vocabulary size)
- Relationships
- None — every token is equally distant from every other
- Efficiency
- Extremely sparse (mostly zeros)
Learned embeddings
- Size
- 768–4096 dimensions (configurable)
- Relationships
- Semantic — similar meanings cluster together
- Efficiency
- Dense — every dimension carries signal
What does cosine similarity between two embedding vectors measure?