Building Production-Ready RAG Systems

Retrieval-Augmented Generation has become the go-to architecture for building AI applications that need to work with your organisation's proprietary data. But the gap between a RAG demo and a production system is vast. This guide covers what it actually takes to build RAG systems that work reliably at scale.

Why RAG Matters

Large Language Models are trained on public data with a knowledge cutoff date. They don't know about your internal documentation, your specific products, or yesterday's meeting notes. RAG bridges this gap by retrieving relevant context from your data and providing it to the LLM alongside the user's question.

The basic flow looks simple:

User asks a question
System retrieves relevant documents from a vector database
Retrieved context + question goes to the LLM
LLM generates an answer grounded in your data

Simple in concept, complex in execution. Let's examine each component.

Document Processing: The Foundation

Your RAG system is only as good as the data you put into it. This phase is often underestimated and is where most production issues originate.

Chunking Strategy

How you split documents into chunks dramatically affects retrieval quality. The naive approach—splitting by character count—rarely works well.

Chunking Best Practices

Semantic chunking: Split at natural boundaries (paragraphs, sections, headers)
Overlap: Include 10-20% overlap between chunks to preserve context
Size matters: 256-512 tokens works well for most use cases
Preserve metadata: Keep document titles, section headers, dates

# Example: Semantic chunking with LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(documents)

Document Types

Real enterprise data comes in many formats. Each requires specific handling:

PDFs: Use extraction libraries that preserve structure (tables, headers). Consider OCR for scanned documents.
HTML/Web: Strip navigation, footers, and boilerplate. Preserve semantic structure.
Code: Chunk by functions/classes, not arbitrary line counts. Include docstrings and comments.
Spreadsheets: Convert to text with row/column context preserved.

Embedding Models: Choose Wisely

The embedding model converts your text chunks into vectors. This choice is permanent—changing embeddings means re-processing your entire corpus.

Model	Dimensions	Context	Best For
OpenAI text-embedding-3-large	3072	8191 tokens	General purpose, high quality
Cohere embed-v3	1024	512 tokens	Multilingual, compression
BGE-large-en-v1.5	1024	512 tokens	Open source, self-hosted
E5-mistral-7b-instruct	4096	32k tokens	Long documents, instructions

For privacy-sensitive deployments, BGE or E5 models can run entirely on your infrastructure.

Vector Database Selection

Your vector database stores embeddings and handles similarity search. The choice depends on scale and operational requirements.

For Getting Started

Chroma: Simple, embedded, great for prototypes
Qdrant: Docker-friendly, good performance, open source
Weaviate: Feature-rich, hybrid search built-in

For Production Scale

Pinecone: Managed, scales effortlessly, but vendor lock-in
Milvus: Open source, handles billions of vectors
pgvector: If you're already on PostgreSQL, add vector search

# Example: Qdrant with local deployment
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient(host="localhost", port=6333)

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)

Retrieval: Beyond Basic Similarity

Naive vector similarity search often returns irrelevant results. Production systems need smarter retrieval strategies.

Hybrid Search

Combine dense vectors (semantic similarity) with sparse vectors (keyword matching). This catches both conceptual matches and exact term matches.

# Hybrid search with reciprocal rank fusion
def hybrid_search(query, k=10, alpha=0.5):
    # Dense retrieval
    dense_results = vector_db.similarity_search(query, k=k*2)

    # Sparse retrieval (BM25)
    sparse_results = bm25_index.search(query, k=k*2)

    # Reciprocal rank fusion
    combined = reciprocal_rank_fusion([dense_results, sparse_results], k=k)

    return combined

Query Transformation

User queries are often ambiguous or poorly phrased. Transform them before retrieval:

Query expansion: Add synonyms and related terms
HyDE: Have the LLM generate a hypothetical answer, then search for similar documents
Multi-query: Generate multiple query variations and merge results

Re-ranking

After initial retrieval, use a more expensive model to re-rank results. Cross-encoder models like Cohere Rerank or BGE-reranker significantly improve precision.

Context Assembly

You've retrieved relevant chunks. Now you need to assemble them into an effective prompt.

Context Assembly Tips

Order chunks by relevance (most relevant first)
Include source metadata for citations
Limit total context to avoid diluting signal with noise
Use clear delimiters between different sources

# Example prompt structure
CONTEXT:
[Source: Q4 Financial Report, Page 12]
Revenue increased 23% year-over-year to $4.2B...

[Source: CEO Letter to Shareholders]
Our strategic investments in AI infrastructure...

---

Based on the above context, answer the user's question.
If the answer cannot be found in the context, say so.

Question: {user_question}

Evaluation: Measuring What Matters

You can't improve what you don't measure. Set up proper evaluation before deploying to production.

Key Metrics

Retrieval Recall@K: What percentage of relevant documents appear in top K results?
Answer Correctness: Does the generated answer match ground truth?
Faithfulness: Is the answer grounded in retrieved context (no hallucinations)?
Latency: End-to-end response time

Building an Evaluation Set

Create a set of 50-100 question-answer pairs with known correct answers. Include edge cases:

Questions with no answer in corpus
Questions requiring multi-document synthesis
Questions with ambiguous wording
Questions about recent data updates

Production Hardening

Caching

Cache at multiple levels:

Embedding cache for repeated queries
Retrieval results cache for common questions
LLM response cache for identical inputs

Fallback Strategies

What happens when retrieval returns nothing relevant?

Expand search to broader scope
Fall back to LLM general knowledge with disclaimer
Route to human support

Observability

Log everything you'll need for debugging:

Original query and any transformations
Retrieved documents and scores
Final prompt sent to LLM
Generated response
Latency breakdown by component

Common Pitfalls

Mistakes We See Repeatedly

Ignoring chunk boundaries: Key information split across chunks becomes unretrievable
Too much context: Stuffing the prompt with marginally relevant data confuses the LLM
Stale indexes: Forgetting to update embeddings when source documents change
Missing metadata: Without source info, users can't verify answers
Evaluation afterthought: Building without measuring leads to unknown quality

Architecture for Scale

A production RAG system isn't a single script—it's a set of services:

Ingestion Pipeline: Watches for new/updated documents, chunks, embeds, indexes
Query Service: Handles user queries, retrieval, context assembly
LLM Gateway: Routes to appropriate models, handles rate limiting, fallbacks
Feedback Loop: Captures user ratings, enables continuous improvement

Getting Started

If you're building your first production RAG system:

Start with a focused corpus (one document type, one use case)
Build evaluation set before building the system
Use managed services initially to reduce operational burden
Invest in observability from day one
Plan for iteration—your first version won't be your last

Need Help Building RAG Systems?

Acumen Labs has implemented RAG architectures across industries—from legal document search to technical support automation. We can help you navigate the complexities and avoid common pitfalls.

Schedule a Consultation