AI in practice
Choosing the right model, understanding cost-performance tradeoffs, and evaluating model quality with benchmarks.
Choosing the right model
Not every problem needs the biggest model. Model selection is about matching capability to requirements: a 7B-parameter model might solve your classification task perfectly at 1/100th the cost of a frontier model. The key dimensions are capability (can it do the task?), latency (how fast?), cost (per-token pricing), and control (can you host it? fine-tune it?).
Frontier models (GPT-4o, Claude Opus, Gemini Ultra) excel at complex reasoning, long-context tasks, and following nuanced instructions — but you pay a premium. Smaller open models (Llama 3, Mistral, Phi) are 10–100× cheaper per token and can run locally, making them ideal for high-volume, narrowly-scoped tasks.
| Model tier | Params | Best for | Cost/token (approx) | Hosting |
|---|---|---|---|---|
| Frontier (GPT-4o, Claude) | Undisclosed | Complex reasoning, agents, multi-step | $2.50–15 / 1M tokens | API only |
| Mid (Llama 3 70B, Mixtral) | ~70B | Summarization, RAG, code gen | $0.50–2 / 1M tokens | Cloud or self-host |
| Small (Llama 3 8B, Phi-3) | 3–8B | Classification, extraction, routing | $0.05–0.30 / 1M tokens | Edge / on-device |
| Tiny (Gemma 2B) | ~2B | On-device, simple Q&A, specs | Free (local) | Phone / laptop |
What is the 'model cascading' strategy and why does it reduce costs?
Evaluation & benchmarks
Benchmarks provide standardized tests for comparing LLMs, but they have limits: models can be trained on benchmark data (contamination), benchmarks may not reflect your actual use case, and aggregate scores hide important failure modes. The key benchmarks are MMLU (57-subject knowledge test), HumanEval (code generation), GSM8K (grade-school math), and Chatbot Arena (human preference via blind A/B voting).
For production, you need your own eval suite — a set of representative inputs with expected outputs, scored by an automated metric (exact match, BLEU, ROUGE) or, increasingly, by another LLM acting as judge (LLM-as-judge). This catches regressions when you swap models, update prompts, or add RAG pipelines.
Public benchmarks
- Scope
- Broad, general-purpose tasks
- Relevance
- Varies — may not match your domain
- Contamination risk
- High — models are trained on public data
- Use case
- Initial model comparison and trend tracking
Custom eval suite
- Scope
- Narrow, your exact use case
- Relevance
- Direct — built from real user queries
- Contamination risk
- Low — eval data is private
- Use case
- Production monitoring and regression detection
Why are custom eval suites more valuable than public benchmarks for production decisions?