Skip to main content

Retrieval-Augmented Generation (RAG)

Intermediate

RAG makes a model answer questions about your data — docs, a knowledge base, a codebase — that it was never trained on. The idea is simple: retrieve the relevant pieces, augment the prompt with them, then generate an answer grounded in those pieces.

The loop

  1. Index your data: split into chunks, embed them, store in a vector (and/or keyword) index.
  2. Retrieve the top chunks most relevant to the question.
  3. Augment: put those chunks in the prompt with an instruction like "Answer only from the context below; if it's not there, say so."
  4. Generate — and ideally cite which chunk each claim came from.

Why RAG instead of fine-tuning?

RAG keeps knowledge fresh (update the data, not the model), provides citations, and is far cheaper than retraining. For most "answer about my documents" needs, it's the right first tool — see Fine-tuning vs Prompting vs RAG.

The failure modes (where RAG quality dies)

  • Bad retrieval = bad answer. If the right chunk isn't retrieved, the model can't use it. Most "RAG is wrong" problems are retrieval problems.
  • Chunking too coarse/fine — wrecks relevance (embeddings).
  • No grounding instruction — the model blends retrieved facts with its own guesses. Tell it to answer only from context and to admit gaps.
  • Stuffing too much — irrelevant chunks dilute the signal and cost tokens. Retrieve few, high-quality chunks.
  • No citations — you can't verify, so you can't trust.

:::tip Evaluate retrieval separately Measure "did we retrieve the right chunk?" apart from "did the model answer well?" It localizes the problem fast. See Evals. :::

Next