Controlling output
Temperature, sampling strategies, and prompt engineering — how to steer the model toward the response you want.
Temperature & randomness
Temperature is a scaling factor applied to the model's output logits before softmax converts them to probabilities. It controls the sharpness of the probability distribution — how concentrated or spread out the next-token probabilities are.
At temperature = 0, the model always picks the highest-probability token (deterministic). At temperature = 1, it samples proportionally to the raw probabilities. Above 1, the distribution flattens — low-probability tokens get a chance, producing more surprising, "creative" output. The formula: `softmax(logits / temperature)`.
T = 0 (greedy)
Always picks the most likely token. Deterministic, reproducible, best for factual tasks like code generation or math.
T = 0.2–0.5
Mostly picks top candidates, rarely explores. Good for structured writing where you want consistency with slight variation.
T = 0.7–1.0
Balanced sampling — the model explores reasonable alternatives. Default for most conversational use cases.
T > 1.0
Flattens probabilities significantly. Output becomes unpredictable — useful for brainstorming, but risks incoherence.
What happens to the next-token probability distribution as temperature increases?
Top-p & top-k sampling
Temperature alone is a blunt instrument — it affects the entire vocabulary uniformly. Top-k and top-p (nucleus) sampling add intelligent filtering: instead of sampling from all 50,000+ tokens, the model only considers the most likely candidates.
Top-k keeps the k highest-probability tokens and renormalizes. Simple, but brittle — the right k depends on the distribution shape, which varies per token. Top-p (nucleus) keeps the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9). It adapts dynamically: when the model is confident, it picks from a narrow set; when uncertain, it considers more options.
Top-k
- How it works
- Keep top k tokens by probability
- Advantage
- Simple, predictable cutoff
- Weakness
- Fixed k — ignores distribution shape
- Best for
- When you want a simple diversity cap
Top-p (nucleus)
- How it works
- Keep minimal tokens reaching cumulative probability p
- Advantage
- Adapts to distribution confidence
- Weakness
- Can be too permissive in flat distributions
- Best for
- General-purpose text generation
A model is very confident (top token has 95% probability). What's the key difference between top-k=40 and top-p=0.9 in this case?
Prompt engineering
Prompt engineering is the practice of designing inputs that reliably produce desired outputs — not by changing the model, but by changing what you put in. The model's behavior is entirely conditioned on the prompt, so small changes in wording, structure, or examples can dramatically shift results.
The most reliable pattern is few-shot prompting: include 2–5 input→output examples before your actual query. The model pattern-matches and continues the format. For reasoning-heavy tasks, chain-of-thought (CoT) — adding "Let's think step by step" or a worked reasoning example — forces the model to generate intermediate steps, dramatically improving accuracy on math, logic, and multi-step problems.
Why does chain-of-thought prompting improve accuracy on reasoning tasks?