Controlling output · AI Fundamentals

Temperature & randomness

Learn

Temperature is a scaling factor applied to the model's output logits before softmax converts them to probabilities. It controls the sharpness of the probability distribution — how concentrated or spread out the next-token probabilities are.

At temperature = 0, the model always picks the highest-probability token (deterministic). At temperature = 1, it samples proportionally to the raw probabilities. Above 1, the distribution flattens — low-probability tokens get a chance, producing more surprising, "creative" output. The formula: `softmax(logits / temperature)`.

T = 0 (greedy)

Always picks the most likely token. Deterministic, reproducible, best for factual tasks like code generation or math.

T = 0.2–0.5

Mostly picks top candidates, rarely explores. Good for structured writing where you want consistency with slight variation.

T = 0.7–1.0

Balanced sampling — the model explores reasonable alternatives. Default for most conversational use cases.

T > 1.0

Flattens probabilities significantly. Output becomes unpredictable — useful for brainstorming, but risks incoherence.

High temperature plus long generation is a recipe for drift. Each unlikely token pushes the model further from the original intent — the model can't "course-correct" because it doesn't plan ahead; it just keeps predicting the next token.

Practice

Temperature0.7

Top token share

25%

of probability mass

Diversity score

35%

relative variety

Creativity

At T=0: deterministic. At T=2: maximum entropy sampling.

Loading…

Recall0/1

Recall

What happens to the next-token probability distribution as temperature increases?

Top-p & top-k sampling

Learn

Temperature alone is a blunt instrument — it affects the entire vocabulary uniformly. Top-k and top-p (nucleus) sampling add intelligent filtering: instead of sampling from all 50,000+ tokens, the model only considers the most likely candidates.

Top-k keeps the k highest-probability tokens and renormalizes. Simple, but brittle — the right k depends on the distribution shape, which varies per token. Top-p (nucleus) keeps the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9). It adapts dynamically: when the model is confident, it picks from a narrow set; when uncertain, it considers more options.

Top-k

How it works: Keep top k tokens by probability
Advantage: Simple, predictable cutoff
Weakness: Fixed k — ignores distribution shape
Best for: When you want a simple diversity cap

Top-p (nucleus)

How it works: Keep minimal tokens reaching cumulative probability p
Advantage: Adapts to distribution confidence
Weakness: Can be too permissive in flat distributions
Best for: General-purpose text generation

Modern LLM APIs use a combination: T ≈ 0.7 + top_p ≈ 0.9. Together they produce diverse but coherent output. T = 0 + top_p = 1 is near-deterministic; T = 1.5 + top_p = 0.95 is maximum creativity.

Practice

Flashcards0/5 reviewed

Try to recall the answer before flipping.

Loading…

Recall0/1

Recall

A model is very confident (top token has 95% probability). What's the key difference between top-k=40 and top-p=0.9 in this case?

Prompt engineering

Learn

Prompt engineering is the practice of designing inputs that reliably produce desired outputs — not by changing the model, but by changing what you put in. The model's behavior is entirely conditioned on the prompt, so small changes in wording, structure, or examples can dramatically shift results.

The most reliable pattern is few-shot prompting: include 2–5 input→output examples before your actual query. The model pattern-matches and continues the format. For reasoning-heavy tasks, chain-of-thought (CoT) — adding "Let's think step by step" or a worked reasoning example — forces the model to generate intermediate steps, dramatically improving accuracy on math, logic, and multi-step problems.

Poor prompt: "Summarize this." Better: "You are a technical editor. Summarize the following article in 3 bullet points, each under 15 words, focusing on methodology and results. Use active voice. [ARTICLE]" Few-shot: "Input: Article about fusion energy Output: • Fusion achieved net energy gain…" …[2 more examples]… "Input: [YOUR ACTUAL ARTICLE]"

Prompt engineering is not a substitute for fine-tuning when the task requires domain-specific knowledge or consistent output formats at scale. It's a spectrum — prompt → few-shot → fine-tune — with increasing investment and reliability.

Practice

0 / 5

Loading…

Recall0/1

Recall

Why does chain-of-thought prompting improve accuracy on reasoning tasks?