Advanced techniques
RAG, fine-tuning, and understanding hallucination — when prompting isn't enough and how to ground model outputs.
RAG: Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) solves two core LLM limitations: stale knowledge (the model only knows its training data) and hallucination on niche topics. Instead of relying solely on the model's parameters, RAG retrieves relevant documents from an external knowledge base and injects them into the prompt before generation.
The pipeline has three stages: (1) Indexing — chunk your documents, embed each chunk into vectors, store in a vector database. (2) Retrieval — embed the user's query, find the k nearest chunks by cosine similarity. (3) Generation — prepend the retrieved chunks to the prompt with instructions like "Answer based only on the provided context."
When RAG shines
Proprietary documents, frequently-updated knowledge bases, compliance where every claim must cite a source.
When RAG struggles
Poor chunking splits concepts mid-thought. Irrelevant retrieval pollutes the context. Multi-hop reasoning across chunks exceeds the retriever's capability.
RAG vs. fine-tuning
RAG is cheaper, faster to update, and more transparent (sources are visible). Fine-tuning bakes knowledge into weights — better for style, tone, and implicit patterns.
What is the primary advantage of RAG over relying solely on the model's parametric knowledge?
Fine-tuning vs. prompting
Fine-tuning continues training a pretrained model on a smaller, domain-specific dataset — adjusting the model's weights to internalize new patterns, styles, or knowledge. Unlike prompting (which only influences a single inference), fine-tuning permanently changes the model's behavior.
Fine-tuning is the right call when: you have hundreds to thousands of high-quality examples, the output format is highly structured and must be consistent, or the task requires implicit pattern recognition that would take too many tokens to describe in a prompt. It's the wrong call when: you're working with rapidly changing data (use RAG), you have fewer than ~50 examples (use few-shot prompting), or the base model already handles the task well.
Prompt engineering
- Cost to start
- Zero (just write text)
- Update speed
- Instant (change the prompt)
- Data needed
- None to a few examples
- Consistency
- Variable — prompt-sensitive
- Knowledge injection
- Limited to context window
Fine-tuning
- Cost to start
- Compute + curated dataset
- Update speed
- Hours to days (retrain)
- Data needed
- 50–10,000+ examples
- Consistency
- High — weights encode the pattern
- Knowledge injection
- Baked into parameters
What is the defining trade-off between fine-tuning and RAG for knowledge-intensive tasks?
Hallucination & grounding
Hallucination is when an LLM generates text that is factually incorrect, nonsensical, or unsupported by the provided context — but sounds plausible. It's not a bug; it's a direct consequence of how LLMs work: they predict the most probable token, not the most factually correct one.
Hallucination has several root causes: (1) the model's training data was incomplete or contradictory for this topic, (2) the model generalizes incorrectly from superficially similar patterns, (3) the sampling process picks an unlikely token that drifts the generation, or (4) the prompt asks for information the model couldn't possibly know (like future events or private data).
RAG grounding
Provide authoritative context in the prompt. The model can still ignore it, but well-structured context with explicit instructions dramatically reduces fabrication.
Constrained decoding
Force the output to follow a grammar or schema — if the model can only produce valid JSON, it can't invent fields.
Self-verification
Ask the model to check its own answer against provided facts or to cite specific sources. Not foolproof, but catches obvious fabrications.
Human-in-the-loop
For high-stakes applications (medical, legal, financial), always have a human verify before acting on LLM output. No technique eliminates hallucination entirely.
Why can't hallucination be completely eliminated in current LLM architectures?