After deploying retrieval-augmented generation systems for 30 different clients, here are the patterns that consistently work — and the ones that look great in demos but collapse under real load.
Vikram Patel
Head of AI Research
Retrieval-augmented generation has become the default architecture for grounding large language models in proprietary data, but the gap between a working prototype and a production-grade system is enormous. Over the past two years, we have deployed RAG pipelines for 30 different clients across industries ranging from legal tech to healthcare, and the failure modes are remarkably consistent. The most common mistake teams make is treating retrieval as a solved problem. They plug in a vector database, embed their documents, and assume cosine similarity will handle the rest. It does not.
The first lesson is that chunking strategy matters far more than embedding model choice. We have seen teams agonize over whether to use OpenAI embeddings or Cohere embeddings while splitting their documents into fixed 512-token chunks with no overlap. In practice, semantic chunking that respects document structure — headings, paragraphs, tables — outperforms naive fixed-size splitting by 20 to 35 percent on retrieval recall. We build custom chunking pipelines for every client that understand their specific document formats, whether those are legal contracts, medical records, or engineering specifications.
The second lesson involves hybrid retrieval. Pure vector search struggles with exact-match queries like part numbers, case IDs, or specific dates. We now default to a hybrid approach that combines dense vector search with sparse keyword search using BM25, with a reciprocal rank fusion step to merge results. This dual-path retrieval consistently outperforms either method alone, especially when users mix natural language questions with precise lookups. The infrastructure overhead is minimal — most vector databases now support hybrid modes natively.
Evaluation is where most teams fall apart entirely. Without a robust evaluation framework, you are flying blind. We build golden test sets with every client before writing a single line of pipeline code: 50 to 100 question-answer pairs grounded in their actual documents. We measure retrieval recall at k, answer faithfulness, and hallucination rate on every pipeline change. This investment in evaluation infrastructure pays for itself within the first month by catching regressions that would otherwise reach production and erode user trust.
Latency and cost optimization become critical once you move past the proof-of-concept stage. A naive RAG pipeline that retrieves 10 chunks and sends them all to GPT-4 on every query will cost a fortune and respond in 5 to 8 seconds. We use a re-ranking step with a lightweight cross-encoder to prune retrieved chunks down to the 3 most relevant before sending them to the generation model. Combined with streaming responses and aggressive caching of repeated queries, we typically achieve sub-2-second time-to-first-token at a fraction of the naive cost.
Finally, the operational side of RAG is something nobody talks about at conferences. Documents change. New ones arrive daily. Stale embeddings lead to stale answers. We build incremental ingestion pipelines that watch for document changes, re-chunk and re-embed only the affected sections, and run regression tests against our golden set after every update. Without this operational backbone, even the best RAG architecture degrades over time. Production RAG is not a model problem — it is a systems engineering problem.
Tagged
Vikram Patel
Head of AI Research at LUMorion
Writes about ai & ml, engineering best practices, and building production systems at scale.