Retrieval-Augmented Generation (RAG)

Intermediate

RAG makes a model answer questions about your data — docs, a knowledge base, a codebase — that it was never trained on. The idea is simple: retrieve the relevant pieces, augment the prompt with them, then generate an answer grounded in those pieces.

The loop

Index your data: split into chunks, embed them, store in a vector (and/or keyword) index.
Retrieve the top chunks most relevant to the question.
Augment: put those chunks in the prompt with an instruction like "Answer only from the context below; if it's not there, say so."
Generate — and ideally cite which chunk each claim came from.

Why RAG instead of fine-tuning?

RAG keeps knowledge fresh (update the data, not the model), provides citations, and is far cheaper than retraining. For most "answer about my documents" needs, it's the right first tool — see Fine-tuning vs Prompting vs RAG.

The failure modes (where RAG quality dies)

Bad retrieval = bad answer. If the right chunk isn't retrieved, the model can't use it. Most "RAG is wrong" problems are retrieval problems.
Chunking too coarse/fine — wrecks relevance (embeddings).
No grounding instruction — the model blends retrieved facts with its own guesses. Tell it to answer only from context and to admit gaps.
Stuffing too much — irrelevant chunks dilute the signal and cost tokens. Retrieve few, high-quality chunks.
No citations — you can't verify, so you can't trust.

:::tip Evaluate retrieval separately Measure "did we retrieve the right chunk?" apart from "did the model answer well?" It localizes the problem fast. See Evals. :::

The loop​

Why RAG instead of fine-tuning?​

The failure modes (where RAG quality dies)​

Next​

The loop

Why RAG instead of fine-tuning?

The failure modes (where RAG quality dies)

Next