Module 5 · Application

AI in practice

Choosing the right model, understanding cost-performance tradeoffs, and evaluating model quality with benchmarks.

0/2 lessons

Choosing the right model

Learn

Not every problem needs the biggest model. Model selection is about matching capability to requirements: a 7B-parameter model might solve your classification task perfectly at 1/100th the cost of a frontier model. The key dimensions are capability (can it do the task?), latency (how fast?), cost (per-token pricing), and control (can you host it? fine-tune it?).

Frontier models (GPT-4o, Claude Opus, Gemini Ultra) excel at complex reasoning, long-context tasks, and following nuanced instructions — but you pay a premium. Smaller open models (Llama 3, Mistral, Phi) are 10–100× cheaper per token and can run locally, making them ideal for high-volume, narrowly-scoped tasks.

Model tier	Params	Best for	Cost/token (approx)	Hosting
Frontier (GPT-4o, Claude)	Undisclosed	Complex reasoning, agents, multi-step	$2.50–15 / 1M tokens	API only
Mid (Llama 3 70B, Mixtral)	~70B	Summarization, RAG, code gen	$0.50–2 / 1M tokens	Cloud or self-host
Small (Llama 3 8B, Phi-3)	3–8B	Classification, extraction, routing	$0.05–0.30 / 1M tokens	Edge / on-device
Tiny (Gemma 2B)	~2B	On-device, simple Q&A, specs	Free (local)	Phone / laptop

Start small and escalate — use a small model for the 80% of queries it handles well, route the remaining 20% to a frontier model. This 'model cascading' pattern often cuts total cost by 60–80% without sacrificing quality.

Practice

Monthly cost = tokens per request × requests per day × 30 × price per token

Tokens per request (input + output)

Requests per day

Price per 1M tokens ($)USD

Monthly cost

$750

Manageable. Monitor as usage grows.

Loading…

Recall0/1

Recall

What is the 'model cascading' strategy and why does it reduce costs?

Evaluation & benchmarks

Learn

Benchmarks provide standardized tests for comparing LLMs, but they have limits: models can be trained on benchmark data (contamination), benchmarks may not reflect your actual use case, and aggregate scores hide important failure modes. The key benchmarks are MMLU (57-subject knowledge test), HumanEval (code generation), GSM8K (grade-school math), and Chatbot Arena (human preference via blind A/B voting).

For production, you need your own eval suite — a set of representative inputs with expected outputs, scored by an automated metric (exact match, BLEU, ROUGE) or, increasingly, by another LLM acting as judge (LLM-as-judge). This catches regressions when you swap models, update prompts, or add RAG pipelines.

Public benchmarks

Scope: Broad, general-purpose tasks
Relevance: Varies — may not match your domain
Contamination risk: High — models are trained on public data
Use case: Initial model comparison and trend tracking

Custom eval suite

Scope: Narrow, your exact use case
Relevance: Direct — built from real user queries
Contamination risk: Low — eval data is private
Use case: Production monitoring and regression detection

A single benchmark score tells you almost nothing about whether a model works for your specific task. Always validate with your own data — at minimum, spot-check 20–50 representative queries against each candidate model.

Practice

Match the pairs0/4

Tap a left item, then its match on the right.

Loading…

Recall0/1

Recall

Why are custom eval suites more valuable than public benchmarks for production decisions?