Retrieval-Augmented Generation has become the go-to architecture for building AI applications that need to work with your organisation's proprietary data. But the gap between a RAG demo and a production system is vast. This guide covers what it actually takes to build RAG systems that work reliably at scale.
Why RAG Matters
Large Language Models are trained on public data with a knowledge cutoff date. They don't know about your internal documentation, your specific products, or yesterday's meeting notes. RAG bridges this gap by retrieving relevant context from your data and providing it to the LLM alongside the user's question.
The basic flow looks simple:
- User asks a question
- System retrieves relevant documents from a vector database
- Retrieved context + question goes to the LLM
- LLM generates an answer grounded in your data
Simple in concept, complex in execution. Let's examine each component.
Document Processing: The Foundation
Your RAG system is only as good as the data you put into it. This phase is often underestimated and is where most production issues originate.
Chunking Strategy
How you split documents into chunks dramatically affects retrieval quality. The naive approach—splitting by character count—rarely works well.
Chunking Best Practices
- Semantic chunking: Split at natural boundaries (paragraphs, sections, headers)
- Overlap: Include 10-20% overlap between chunks to preserve context
- Size matters: 256-512 tokens works well for most use cases
- Preserve metadata: Keep document titles, section headers, dates
# Example: Semantic chunking with LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
Document Types
Real enterprise data comes in many formats. Each requires specific handling:
- PDFs: Use extraction libraries that preserve structure (tables, headers). Consider OCR for scanned documents.
- HTML/Web: Strip navigation, footers, and boilerplate. Preserve semantic structure.
- Code: Chunk by functions/classes, not arbitrary line counts. Include docstrings and comments.
- Spreadsheets: Convert to text with row/column context preserved.
Embedding Models: Choose Wisely
The embedding model converts your text chunks into vectors. This choice is permanent—changing embeddings means re-processing your entire corpus.
| Model | Dimensions | Context | Best For |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8191 tokens | General purpose, high quality |
| Cohere embed-v3 | 1024 | 512 tokens | Multilingual, compression |
| BGE-large-en-v1.5 | 1024 | 512 tokens | Open source, self-hosted |
| E5-mistral-7b-instruct | 4096 | 32k tokens | Long documents, instructions |
For privacy-sensitive deployments, BGE or E5 models can run entirely on your infrastructure.
Vector Database Selection
Your vector database stores embeddings and handles similarity search. The choice depends on scale and operational requirements.
For Getting Started
- Chroma: Simple, embedded, great for prototypes
- Qdrant: Docker-friendly, good performance, open source
- Weaviate: Feature-rich, hybrid search built-in
For Production Scale
- Pinecone: Managed, scales effortlessly, but vendor lock-in
- Milvus: Open source, handles billions of vectors
- pgvector: If you're already on PostgreSQL, add vector search
# Example: Qdrant with local deployment
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
client = QdrantClient(host="localhost", port=6333)
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)
Retrieval: Beyond Basic Similarity
Naive vector similarity search often returns irrelevant results. Production systems need smarter retrieval strategies.
Hybrid Search
Combine dense vectors (semantic similarity) with sparse vectors (keyword matching). This catches both conceptual matches and exact term matches.
# Hybrid search with reciprocal rank fusion
def hybrid_search(query, k=10, alpha=0.5):
# Dense retrieval
dense_results = vector_db.similarity_search(query, k=k*2)
# Sparse retrieval (BM25)
sparse_results = bm25_index.search(query, k=k*2)
# Reciprocal rank fusion
combined = reciprocal_rank_fusion([dense_results, sparse_results], k=k)
return combined
Query Transformation
User queries are often ambiguous or poorly phrased. Transform them before retrieval:
- Query expansion: Add synonyms and related terms
- HyDE: Have the LLM generate a hypothetical answer, then search for similar documents
- Multi-query: Generate multiple query variations and merge results
Re-ranking
After initial retrieval, use a more expensive model to re-rank results. Cross-encoder models like Cohere Rerank or BGE-reranker significantly improve precision.
Context Assembly
You've retrieved relevant chunks. Now you need to assemble them into an effective prompt.
Context Assembly Tips
- Order chunks by relevance (most relevant first)
- Include source metadata for citations
- Limit total context to avoid diluting signal with noise
- Use clear delimiters between different sources
# Example prompt structure
CONTEXT:
[Source: Q4 Financial Report, Page 12]
Revenue increased 23% year-over-year to $4.2B...
[Source: CEO Letter to Shareholders]
Our strategic investments in AI infrastructure...
---
Based on the above context, answer the user's question.
If the answer cannot be found in the context, say so.
Question: {user_question}
Evaluation: Measuring What Matters
You can't improve what you don't measure. Set up proper evaluation before deploying to production.
Key Metrics
- Retrieval Recall@K: What percentage of relevant documents appear in top K results?
- Answer Correctness: Does the generated answer match ground truth?
- Faithfulness: Is the answer grounded in retrieved context (no hallucinations)?
- Latency: End-to-end response time
Building an Evaluation Set
Create a set of 50-100 question-answer pairs with known correct answers. Include edge cases:
- Questions with no answer in corpus
- Questions requiring multi-document synthesis
- Questions with ambiguous wording
- Questions about recent data updates
Production Hardening
Caching
Cache at multiple levels:
- Embedding cache for repeated queries
- Retrieval results cache for common questions
- LLM response cache for identical inputs
Fallback Strategies
What happens when retrieval returns nothing relevant?
- Expand search to broader scope
- Fall back to LLM general knowledge with disclaimer
- Route to human support
Observability
Log everything you'll need for debugging:
- Original query and any transformations
- Retrieved documents and scores
- Final prompt sent to LLM
- Generated response
- Latency breakdown by component
Common Pitfalls
Mistakes We See Repeatedly
- Ignoring chunk boundaries: Key information split across chunks becomes unretrievable
- Too much context: Stuffing the prompt with marginally relevant data confuses the LLM
- Stale indexes: Forgetting to update embeddings when source documents change
- Missing metadata: Without source info, users can't verify answers
- Evaluation afterthought: Building without measuring leads to unknown quality
Architecture for Scale
A production RAG system isn't a single script—it's a set of services:
- Ingestion Pipeline: Watches for new/updated documents, chunks, embeds, indexes
- Query Service: Handles user queries, retrieval, context assembly
- LLM Gateway: Routes to appropriate models, handles rate limiting, fallbacks
- Feedback Loop: Captures user ratings, enables continuous improvement
Getting Started
If you're building your first production RAG system:
- Start with a focused corpus (one document type, one use case)
- Build evaluation set before building the system
- Use managed services initially to reduce operational burden
- Invest in observability from day one
- Plan for iteration—your first version won't be your last
Need Help Building RAG Systems?
Acumen Labs has implemented RAG architectures across industries—from legal document search to technical support automation. We can help you navigate the complexities and avoid common pitfalls.
Schedule a Consultation