Module 3 / Controlling output
Module 3 · Architecture

Controlling output

Temperature, sampling strategies, and prompt engineering — how to steer the model toward the response you want.

0/3 lessons

Temperature & randomness

Learn

Temperature is a scaling factor applied to the model's output logits before softmax converts them to probabilities. It controls the sharpness of the probability distribution — how concentrated or spread out the next-token probabilities are.

At temperature = 0, the model always picks the highest-probability token (deterministic). At temperature = 1, it samples proportionally to the raw probabilities. Above 1, the distribution flattens — low-probability tokens get a chance, producing more surprising, "creative" output. The formula: `softmax(logits / temperature)`.

T = 0 (greedy)

Always picks the most likely token. Deterministic, reproducible, best for factual tasks like code generation or math.

T = 0.2–0.5

Mostly picks top candidates, rarely explores. Good for structured writing where you want consistency with slight variation.

T = 0.7–1.0

Balanced sampling — the model explores reasonable alternatives. Default for most conversational use cases.

T > 1.0

Flattens probabilities significantly. Output becomes unpredictable — useful for brainstorming, but risks incoherence.

High temperature plus long generation is a recipe for drift. Each unlikely token pushes the model further from the original intent — the model can't "course-correct" because it doesn't plan ahead; it just keeps predicting the next token.
Practice
Loading…
Recall0/1
Recall

What happens to the next-token probability distribution as temperature increases?

Top-p & top-k sampling

Learn

Temperature alone is a blunt instrument — it affects the entire vocabulary uniformly. Top-k and top-p (nucleus) sampling add intelligent filtering: instead of sampling from all 50,000+ tokens, the model only considers the most likely candidates.

Top-k keeps the k highest-probability tokens and renormalizes. Simple, but brittle — the right k depends on the distribution shape, which varies per token. Top-p (nucleus) keeps the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9). It adapts dynamically: when the model is confident, it picks from a narrow set; when uncertain, it considers more options.

Top-k

How it works
Keep top k tokens by probability
Advantage
Simple, predictable cutoff
Weakness
Fixed k — ignores distribution shape
Best for
When you want a simple diversity cap

Top-p (nucleus)

How it works
Keep minimal tokens reaching cumulative probability p
Advantage
Adapts to distribution confidence
Weakness
Can be too permissive in flat distributions
Best for
General-purpose text generation
Modern LLM APIs use a combination: T ≈ 0.7 + top_p ≈ 0.9. Together they produce diverse but coherent output. T = 0 + top_p = 1 is near-deterministic; T = 1.5 + top_p = 0.95 is maximum creativity.
Practice
Loading…
Recall0/1
Recall

A model is very confident (top token has 95% probability). What's the key difference between top-k=40 and top-p=0.9 in this case?

Prompt engineering

Learn

Prompt engineering is the practice of designing inputs that reliably produce desired outputs — not by changing the model, but by changing what you put in. The model's behavior is entirely conditioned on the prompt, so small changes in wording, structure, or examples can dramatically shift results.

The most reliable pattern is few-shot prompting: include 2–5 input→output examples before your actual query. The model pattern-matches and continues the format. For reasoning-heavy tasks, chain-of-thought (CoT) — adding "Let's think step by step" or a worked reasoning example — forces the model to generate intermediate steps, dramatically improving accuracy on math, logic, and multi-step problems.

Poor prompt: "Summarize this." Better: "You are a technical editor. Summarize the following article in 3 bullet points, each under 15 words, focusing on methodology and results. Use active voice. [ARTICLE]" Few-shot: "Input: Article about fusion energy Output: • Fusion achieved net energy gain…" …[2 more examples]… "Input: [YOUR ACTUAL ARTICLE]"
Prompt engineering is not a substitute for fine-tuning when the task requires domain-specific knowledge or consistent output formats at scale. It's a spectrum — prompt → few-shot → fine-tune — with increasing investment and reliability.
Practice
Loading…
Recall0/1
Recall

Why does chain-of-thought prompting improve accuracy on reasoning tasks?