The Promise and Reality of RAG

Retrieval-Augmented Generation (RAG) is the most practical way to give AI systems access to your proprietary data. Instead of fine-tuning a model on your documents (expensive, slow, and hard to update), RAG retrieves relevant chunks at query time and feeds them to the LLM as context.

The promise: AI that speaks your truth, grounded in your actual data, with no hallucinations.

The reality: Getting RAG right in production is harder than the tutorials suggest. Every decision — from how you chunk your documents to how you rank retrieval results — has a measurable impact on answer quality.

The RAG Pipeline

A production RAG system has five stages:

1. Document Ingestion

Your source documents — PDFs, web pages, internal wikis, database records — need to be parsed into clean text. This sounds trivial but it's where most quality issues originate. Poor PDF extraction, lost table formatting, and missing metadata will degrade everything downstream.

What works: Apache Tika for diverse formats, dedicated parsers for PDFs (PyMuPDF handles tables well), and always preserving document structure and metadata.

2. Chunking Strategy

How you split documents into retrievable pieces is the single most impactful decision. Too large and you dilute the signal. Too small and you lose context.

What works: Semantic chunking that respects document structure — split on headers, paragraphs, and logical boundaries rather than fixed character counts. Overlap between chunks (10-15%) helps with edge cases. For code and technical docs, syntax-aware chunking is essential.

3. Embedding & Indexing

Convert chunks into vector embeddings and store them in a vector database. The choice of embedding model matters more than most teams realize.

What works: OpenAI's text-embedding-3-large for general use, Cohere's embed-v3 for multilingual, or open-source alternatives like BGE and E5 for on-premise deployments. For the vector store, Pinecone for managed simplicity, Weaviate for hybrid search, or pgvector if you want to stay in PostgreSQL.

4. Retrieval & Re-Ranking

When a query comes in, you retrieve the top-K most relevant chunks. But similarity search alone isn't enough — you need re-ranking to push the truly relevant results to the top.

What works: Hybrid search (combining vector similarity with BM25 keyword search) consistently outperforms either approach alone. A cross-encoder re-ranker on the top 20-50 results dramatically improves precision. Cohere Rerank and ColBERT are strong options.

5. Generation & Citation

Feed the retrieved chunks to the LLM with a carefully crafted prompt. The model generates an answer grounded in the provided context, ideally with citations pointing back to source documents.

What works: Structured prompts that explicitly instruct the model to cite sources, refuse to answer when context is insufficient, and distinguish between what the data says vs. general knowledge. Include confidence indicators for the end user.

Common Failure Modes

The "close but wrong" problem

The retriever returns chunks that are topically related but don't actually answer the question. Solution: fine-tune your embedding model on domain-specific query-document pairs, or use a re-ranker trained on your data.

Stale data

Your knowledge base is only as current as your last ingestion run. Solution: real-time or near-real-time ingestion pipelines with webhook triggers when source documents change.

Context window overload

Stuffing too many chunks into the prompt degrades answer quality — the model loses focus. Solution: aggressive re-ranking to select only the 3-5 most relevant chunks, and smart context compression techniques.

Evaluation: The Missing Piece

Most RAG tutorials skip evaluation entirely. In production, you need:

Retrieval metrics: Are the right chunks being retrieved? (Recall@K, MRR)
Answer quality: Is the generated answer correct, complete, and well-cited? (Human evaluation + LLM-as-judge)
Faithfulness: Does the answer stay grounded in the context, or does the model hallucinate? (Automated faithfulness scoring)

We build evaluation pipelines that run automatically on every change to the RAG system — new embedding models, different chunking strategies, prompt modifications — so you always know whether a change helped or hurt.

Our Stack for Production RAG

After building RAG systems across healthcare, legal, and enterprise SaaS, here's the stack we've converged on:

Ingestion: Custom parsers + Apache Tika + metadata extraction
Chunking: Semantic splitting with configurable overlap
Embeddings: OpenAI or domain-fine-tuned models
Vector Store: Pinecone (managed) or Weaviate (self-hosted)
Retrieval: Hybrid search + cross-encoder re-ranking
Generation: Claude or GPT-4 with structured citation prompts
Evaluation: Automated pipeline with RAGAS metrics

The specifics vary by use case, but this architecture has proven reliable at scale.

Building a RAG system? We can help you get it right the first time. Schedule a consultation.

Building Production RAG Systems