Advanced techniques · AI Fundamentals

RAG: Retrieval-Augmented Generation

Learn

Retrieval-Augmented Generation (RAG) solves two core LLM limitations: stale knowledge (the model only knows its training data) and hallucination on niche topics. Instead of relying solely on the model's parameters, RAG retrieves relevant documents from an external knowledge base and injects them into the prompt before generation.

The pipeline has three stages: (1) Indexing — chunk your documents, embed each chunk into vectors, store in a vector database. (2) Retrieval — embed the user's query, find the k nearest chunks by cosine similarity. (3) Generation — prepend the retrieved chunks to the prompt with instructions like "Answer based only on the provided context."

RAG Pipeline: Documents → Chunk → Embed → Vector DB ↓ User Query → Embed ──→ Similarity Search → Top-k chunks ↓ LLM ← "Context: [chunks]\n\nQuestion: [query]\n\nAnswer:"

When RAG shines

Proprietary documents, frequently-updated knowledge bases, compliance where every claim must cite a source.

When RAG struggles

Poor chunking splits concepts mid-thought. Irrelevant retrieval pollutes the context. Multi-hop reasoning across chunks exceeds the retriever's capability.

RAG vs. fine-tuning

RAG is cheaper, faster to update, and more transparent (sources are visible). Fine-tuning bakes knowledge into weights — better for style, tone, and implicit patterns.

RAG doesn't eliminate hallucination — it reduces it by grounding the model in provided facts. But if retrieval returns irrelevant chunks, or the model ignores the context, hallucination still happens.

Practice

Put in order

First step → Last step

1Embed chunks into vector representations
2Embed the user query
3Store embeddings in a vector database
4Retrieve top-k most similar chunks
5Chunk documents into segments
6Prepends chunks to prompt and generate answer

Loading…

Recall0/1

Recall

What is the primary advantage of RAG over relying solely on the model's parametric knowledge?

Fine-tuning vs. prompting

Learn

Fine-tuning continues training a pretrained model on a smaller, domain-specific dataset — adjusting the model's weights to internalize new patterns, styles, or knowledge. Unlike prompting (which only influences a single inference), fine-tuning permanently changes the model's behavior.

Fine-tuning is the right call when: you have hundreds to thousands of high-quality examples, the output format is highly structured and must be consistent, or the task requires implicit pattern recognition that would take too many tokens to describe in a prompt. It's the wrong call when: you're working with rapidly changing data (use RAG), you have fewer than ~50 examples (use few-shot prompting), or the base model already handles the task well.

Prompt engineering

Cost to start: Zero (just write text)
Update speed: Instant (change the prompt)
Data needed: None to a few examples
Consistency: Variable — prompt-sensitive
Knowledge injection: Limited to context window

Fine-tuning

Cost to start: Compute + curated dataset
Update speed: Hours to days (retrain)
Data needed: 50–10,000+ examples
Consistency: High — weights encode the pattern
Knowledge injection: Baked into parameters

The most cost-effective approach is often hybrid: use RAG for factual grounding, a well-crafted system prompt + few-shot examples for instruction, and fine-tune only when the model consistently fails the same way despite your best prompts.

Practice

Flashcards0/5 reviewed

Try to recall the answer before flipping.

Loading…

Recall0/1

Recall

What is the defining trade-off between fine-tuning and RAG for knowledge-intensive tasks?

Hallucination & grounding

Learn

Hallucination is when an LLM generates text that is factually incorrect, nonsensical, or unsupported by the provided context — but sounds plausible. It's not a bug; it's a direct consequence of how LLMs work: they predict the most probable token, not the most factually correct one.

Hallucination has several root causes: (1) the model's training data was incomplete or contradictory for this topic, (2) the model generalizes incorrectly from superficially similar patterns, (3) the sampling process picks an unlikely token that drifts the generation, or (4) the prompt asks for information the model couldn't possibly know (like future events or private data).

RAG grounding

Provide authoritative context in the prompt. The model can still ignore it, but well-structured context with explicit instructions dramatically reduces fabrication.

Constrained decoding

Force the output to follow a grammar or schema — if the model can only produce valid JSON, it can't invent fields.

Self-verification

Ask the model to check its own answer against provided facts or to cite specific sources. Not foolproof, but catches obvious fabrications.

Human-in-the-loop

For high-stakes applications (medical, legal, financial), always have a human verify before acting on LLM output. No technique eliminates hallucination entirely.

Hallucination cannot be fully eliminated with current architectures — it's inherent to autoregressive generation. Every mitigation reduces the rate, but zero-hallucination guarantees require external verification systems.

Practice

Match the pairs0/4

Tap a left item, then its match on the right.

Loading…

Recall0/1

Recall

Why can't hallucination be completely eliminated in current LLM architectures?