Module 5 / AI in practice
Module 5 · Application

AI in practice

Choosing the right model, understanding cost-performance tradeoffs, and evaluating model quality with benchmarks.

0/2 lessons

Choosing the right model

Learn

Not every problem needs the biggest model. Model selection is about matching capability to requirements: a 7B-parameter model might solve your classification task perfectly at 1/100th the cost of a frontier model. The key dimensions are capability (can it do the task?), latency (how fast?), cost (per-token pricing), and control (can you host it? fine-tune it?).

Frontier models (GPT-4o, Claude Opus, Gemini Ultra) excel at complex reasoning, long-context tasks, and following nuanced instructions — but you pay a premium. Smaller open models (Llama 3, Mistral, Phi) are 10–100× cheaper per token and can run locally, making them ideal for high-volume, narrowly-scoped tasks.

Model tierParamsBest forCost/token (approx)Hosting
Frontier (GPT-4o, Claude)UndisclosedComplex reasoning, agents, multi-step$2.50–15 / 1M tokensAPI only
Mid (Llama 3 70B, Mixtral)~70BSummarization, RAG, code gen$0.50–2 / 1M tokensCloud or self-host
Small (Llama 3 8B, Phi-3)3–8BClassification, extraction, routing$0.05–0.30 / 1M tokensEdge / on-device
Tiny (Gemma 2B)~2BOn-device, simple Q&A, specsFree (local)Phone / laptop
Start small and escalate — use a small model for the 80% of queries it handles well, route the remaining 20% to a frontier model. This 'model cascading' pattern often cuts total cost by 60–80% without sacrificing quality.
Practice
Loading…
Recall0/1
Recall

What is the 'model cascading' strategy and why does it reduce costs?

Evaluation & benchmarks

Learn

Benchmarks provide standardized tests for comparing LLMs, but they have limits: models can be trained on benchmark data (contamination), benchmarks may not reflect your actual use case, and aggregate scores hide important failure modes. The key benchmarks are MMLU (57-subject knowledge test), HumanEval (code generation), GSM8K (grade-school math), and Chatbot Arena (human preference via blind A/B voting).

For production, you need your own eval suite — a set of representative inputs with expected outputs, scored by an automated metric (exact match, BLEU, ROUGE) or, increasingly, by another LLM acting as judge (LLM-as-judge). This catches regressions when you swap models, update prompts, or add RAG pipelines.

Public benchmarks

Scope
Broad, general-purpose tasks
Relevance
Varies — may not match your domain
Contamination risk
High — models are trained on public data
Use case
Initial model comparison and trend tracking

Custom eval suite

Scope
Narrow, your exact use case
Relevance
Direct — built from real user queries
Contamination risk
Low — eval data is private
Use case
Production monitoring and regression detection
A single benchmark score tells you almost nothing about whether a model works for your specific task. Always validate with your own data — at minimum, spot-check 20–50 representative queries against each candidate model.
Practice
Loading…
Recall0/1
Recall

Why are custom eval suites more valuable than public benchmarks for production decisions?