We've shipped RAG systems for pharmaceutical intelligence, consumer product reviews, legal document management, engineering documentation, and enterprise knowledge bases. Not prototypes — production systems that real people depend on every day. This article documents what we have learned optimizing these systems for production: what to tune, in what order, and why the fundamentals matter more than the frontier.

Here's what we've actually learned. No theory, no hype, just the patterns and pitfalls from building retrieval-augmented generation at scale.

The three RAG architectures we deploy at enterprise scale

Before the lessons, the map. Across ten enterprise deployments, three distinct RAG architectures have emerged as the viable options at production scale. Each makes different trade-offs between retrieval quality, latency, cost, and operational complexity.

Naive RAG — a single retrieval pass feeding directly into generation. Fast, cheap, and sufficient for simple corpora with clean prose and low query complexity. We use this as the baseline in every new engagement. It almost never survives contact with real enterprise data.

Hybrid RAG — dense vector search combined with sparse keyword search (BM25 or similar), with a reranker sitting between retrieval and generation. This is where most production systems land. The dual retrieval approach handles both semantic understanding and exact-match requirements, which real enterprise queries almost always need simultaneously. The 70/30 dense/sparse weighting is a starting point, not a rule — domain-specific jargon and regulatory terminology push the balance toward keywords.

Agentic RAG — multi-step retrieval where an orchestration layer decides what to retrieve, evaluates whether the retrieved context is sufficient, and retrieves again if not. Higher latency, higher cost, meaningfully better quality for complex cross-document queries. We deploy this for pharmaceutical intelligence and legal document systems where query complexity is high and answer quality is non-negotiable.

The lessons below apply across all three architectures, but the trade-offs differ. Where a lesson applies differently depending on architecture, we note it.

1. Your chunking strategy matters more than your model choice

Everyone obsesses over which LLM to use. GPT-4 vs Claude vs Gemini. In practice, the difference between a good and bad RAG system is almost always in the retrieval layer — and that starts with how you chunk your documents.

We've tried fixed-size chunks, sentence-based splitting, page-level chunks, and hybrid approaches. The winner depends on your corpus. For regulatory pharmaceutical documents, page-level chunking with table preservation worked best. For consumer product reviews, smaller semantic chunks (512 tokens) with minimal overlap outperformed everything else.

The lesson: Run chunking experiments before touching the prompt. We built an automated sweep tool that tests chunk sizes, overlap ratios, and splitting strategies against a fixed eval set. It runs overnight and we wake up to results.

2. Hybrid search beats pure vector search. Every time.

Pure semantic search sounds elegant — embed everything, find the closest vectors, done. In practice, it misses things. A user searching for "PCI DSS compliance requirements" needs exact keyword matching alongside semantic understanding.

In every deployment, a weighted combination of dense (semantic) and sparse (keyword) search outperformed either alone. The typical sweet spot is 70% dense, 30% sparse — but this varies by corpus. Domain-specific jargon pushes the balance toward keywords.

We use: Pinecone, OpenSearch, or Milvus for dense vectors, combined with BM25 or sparse vectors from BGE-M3. The hybrid approach consistently delivers 15-25% better recall than pure vector search.

3. Citation is non-negotiable in regulated industries

For our pharmaceutical client, every AI-generated answer needs to cite its sources — document name, page number, and the specific passage. Without this, the system is useless. Doctors and regulatory affairs teams won't trust an answer they can't verify.

We built a grounding system that tracks exactly which chunks contributed to each answer, preserves the source metadata through the entire pipeline, and presents citations inline. It's not glamorous engineering, but it's the difference between a demo and a system people actually use.

4. Evaluation is the hardest part

You can't improve what you can't measure. And measuring RAG quality is genuinely hard.

For retrieval, we track MRR (Mean Reciprocal Rank), Recall@K, and NDCG. But building the ground truth eval set — queries paired with the documents that should be retrieved — requires domain expertise and manual labeling. There's no shortcut.

We start with 20-30 hand-curated queries across different categories (factual lookups, technical specs, cross-document questions, table-heavy queries). Then we expand semi-automatically as we discover edge cases. Holding out a test set to detect overfitting is critical.

Our rule: If you don't have an eval set, you don't have a RAG system. You have a chatbot with a search bar.

5. Tables and structured data need special treatment

Most RAG tutorials assume your documents are clean prose. Real enterprise documents are full of tables, forms, headers, footers, and weird formatting. Standard text extraction butchers tables.

Our approach: extract tables separately, generate natural-language summaries using a local LLM, and embed both the raw table data and the summary. The summary gives the embedding model something semantic to work with. This single technique improved retrieval quality measurably in every deployment with table-heavy documents.

6. Start with the retrieval, not the generation

A common mistake: teams spend weeks crafting the perfect system prompt and few-shot examples before making sure retrieval actually works. If the right documents don't appear in the context window, no amount of prompt engineering will save you.

Our process: get retrieval working first. Measure it. Iterate on chunking, embeddings, and search until the eval metrics are solid. Only then optimize the generation prompt. We've seen teams waste months on generation when the real problem was retrieval returning irrelevant chunks.

7. Cost management is an architecture decision

At scale, RAG costs add up fast. Embedding 100K+ documents, running vector search on every query, and sending large context windows to an LLM — each step has a cost.

  • Cache aggressively. If the same query hits the same documents, cache the response.
  • Use cheaper models for filtering and reranking, expensive models only for final generation.
  • Context caching (supported by Claude and GPT) can reduce costs by 50-80% for repeated system prompts.
  • Consider on-premise vector databases (Zvec, FAISS) for sensitive data — no per-query cloud costs.
  • 8. The "overnight experiment" pattern changed how we tune

    Inspired by Karpathy's autoresearch, we built a system that lets an AI agent sweep RAG parameters autonomously — we've written up the full approach in AI-Optimised RAG Pipeline: The Overnight Autoresearch Pattern. Define the parameter ranges, set a fixed compute budget per experiment, point it at an eval set, and let it run overnight.

    For one client, 24 experiments ran in six hours. The winning configuration (512-token chunks, 32-token overlap, 0.7/0.3 dense/sparse weighting) achieved 82% Recall@5 — significantly better than our hand-tuned baseline. The agent found parameter interactions we wouldn't have tested manually.

    9. Production RAG needs monitoring, not just deployment

    Shipping a RAG system is the beginning, not the end. Document corpora change. New content gets added. Query patterns evolve. A system that worked at launch can degrade silently.

  • Track retrieval quality metrics in production (not just latency and error rates)
  • Log queries that return no relevant results — these are your improvement opportunities
  • Re-embed and re-index when significant new content is added
  • Monitor for hallucination patterns — queries where the model confidently cites non-existent information
  • 10. The model matters less than you think

    We've deployed RAG with GPT-4, Claude Sonnet, Gemini Flash, and open-source models. The quality difference between top-tier models is smaller than the quality difference between good and bad retrieval.

    Choose your model based on practical constraints: latency requirements, cost at your query volume, data residency rules, and context window size. For most enterprise use cases, Claude Sonnet or GPT-4o with good retrieval will outperform GPT-5 with mediocre retrieval.

    The bottom line

    RAG in production is 20% AI and 80% engineering. The sexy part — picking models, writing prompts — is the smallest piece. The hard work is in document extraction, chunking strategy, evaluation methodology, citation tracking, and operational monitoring.

    Every deployment has taught us that the fundamentals matter more than the frontier. Get the retrieval right, measure everything, and iterate relentlessly. The overnight experiment dream isn't a dream anymore — it's how we work.