Production RAG Pipeline: Zero Hallucination Guide

Most engineers learn RAG like this:

Chunk → Embed → Store → Retrieve → LLM → Done.

That works for a weekend project. It fails at 10M+ documents in production.

Here's what production RAG actually looks like — every stage, every package, every decision explained.

Stage 1 — Ingest + Normalize

Before any chunking or embedding happens, documents must be cleaned and standardized.

What it covers:

Deduplication — hash content, skip duplicates
Format standardization — PDF, DOCX, HTML, TXT all converted to plain text
Metadata extraction — title, author, date, source tagged on every document
Language detection — route to correct embedding model
Versioning — when document updates, mark old chunks stale

Example: Two files — "policy_2024.pdf" and "policy_2024_copy.pdf" with identical content. Without deduplication → both indexed → retrieval returns duplicate chunks → wrong answer repeated twice. With deduplication → hash matches → second file skipped → clean index.

Packages: pdfplumber, python-docx, beautifulsoup4, hashlib, langdetect

Stage 2 — Chunking Strategy

Chunking happens once during ingestion — offline, not at query time.

5 sub-steps:

Document Partitioning — detect headings, sections, preserve hierarchy
Chunk Size + Overlap — 300–800 tokens, 10–20% overlap prevents boundary loss
Semantic Chunking — embed sentences, split where cosine similarity drops
Special Handling — tables stay together with header, code splits at function boundaries
Metadata Enrichment — every chunk gets doc_id, section, page, timestamp

Example — why overlap matters: Without overlap: Chunk 1: "The request was rejected because..." Chunk 2: "...the customer failed to submit documents." → Answer is split across two chunks. If only Chunk 1 is retrieved → LLM sees incomplete reason.

With overlap: Chunk 1: "The request was rejected because the customer failed..." Chunk 2: "...the customer failed to submit documents on time." → Answer is complete in both chunks. Retrieval always finds the full context.

Packages: nltk, tiktoken, sentence-transformers, unstructured, ast, pandas

Embedding

After chunking, each chunk is embedded using SBERT and stored. This happens once offline. At query time — only the query gets embedded using the same model.

Rule: Same model for chunk embedding AND query embedding. Different models = broken retrieval.

Example: Chunk embedded with: all-MiniLM-L6-v2 → vector [0.23, 0.87, 0.11 ...] Query embedded with: different model → vector [0.91, 0.02, 0.76 ...] → Cosine similarity is meaningless — vectors are in different spaces → retrieval completely breaks.

Packages: sentence-transformers (all-MiniLM-L6-v2), faiss, Azure AI Search

Stage 3 — Hybrid Retrieval (BM25 + Vector)

Neither BM25 nor vector search alone is enough.

BM25 — keyword scorer. Finds exact terms, invoice numbers, product codes. No embeddings.
Vector — semantic scorer. Finds "cardiac arrest" when you search "heart attack".
RRF (Reciprocal Rank Fusion) — merges both ranked lists by position, not score.

Example: Query: "invoice #INV-2024-991 status" → Vector search misses it — semantic model doesn't memorize exact invoice numbers. → BM25 finds it instantly — exact string match.

Query: "heart attack coverage" → BM25 scores ZERO on chunk saying "cardiac arrest" — different words. → Vector finds it — same meaning, different words.

Hybrid covers both cases. RRF merges by rank position because BM25 scores (0–10) and cosine scores (0–1) cannot be directly compared.

Packages: rank_bm25, nltk, faiss, sentence-transformers

Stage 4 — ANN + Cross-Encoder Reranking

Two-stage funnel:

ANN (Approximate Nearest Neighbor) — 10M chunks → top 50 in 2ms using HNSW graph index
Cross-Encoder Reranker — reads query + chunk together, deep relevance score, returns top 10

Why not run reranker on all 10M? At 200ms per pair → 200ms × 10M = 23 days per query. Impossible. ANN gets you to 50 fast. Reranker gets you to 10 accurately.

Example — before vs after reranking: Query: "What covers cardiac arrest?"

Before reranking (cosine order):

Fire coverage 0.81
Flood coverage 0.79
Cardiac arrest 0.76

After reranking (cross-encoder order):

Cardiac arrest 0.97 ← correctly moved to top
Fire coverage 0.41 ← correctly pushed down
Flood coverage 0.38 ← correctly pushed down

Cross-encoder reads query + chunk together — understands fire is not relevant to cardiac arrest.

Packages: faiss (IndexHNSWFlat), sentence-transformers (CrossEncoder), cross-encoder/ms-marco-MiniLM-L-6-v2

Stage 5 — Confidence Scoring

Before the LLM sees anything, every chunk gets scored on:

Freshness — how old is the document
Authority — official source vs random wiki
Overlap consistency — do other chunks agree or contradict
Retrieval consistency — was it strong in both BM25 and vector

Example: Chunk A: "Fire coverage limit is $500,000" — from official 2024 document → score: 0.93 → PASS Chunk B: "Fire coverage limit is $200,000" — from 2017 wiki page → score: 0.31 → FILTERED OUT

Low confidence chunks are filtered out. If zero chunks pass → system returns "Insufficient Evidence" instead of hallucinating.

No external package needed. Uses metadata + embeddings already computed in earlier stages.

Stage 6 — Constrained Generation

LLM answers only from trusted chunks. System prompt enforces this:

"Answer ONLY using the context below. Do NOT use outside knowledge. If answer not in context, say I don't know."

Temperature = 0 for deterministic output.

Example: Context given: "Fire coverage limit is $500,000 per incident." Query: "What is the fire coverage limit?" LLM: "Based on the provided document, fire coverage limit is $500,000 per incident." → No guessing. No training data used. Pure context answer.

Packages: LangChain, LangGraph, OpenAI API

Stage 7 — Citation-Backed Responses

Every sentence linked to source document, section, and page. Chunk metadata (doc_id, title, section, page) flows from Stage 1 all the way to the final response.

Example: "Fire coverage limit is $500,000 [1]" [1] policy_2024.pdf — Section: Coverage Details — Page 12

Auditors open page 12 and verify the answer directly. Fully auditable.

Supporting Layers

Continuous Evals — Precision@K, Recall@K, Faithfulness, Hallucination Rate tracked automatically Packages: RAGAS, DeepEval, MLflow

Caching — Redis stores frequent query results. Same query never runs full pipeline twice. Redis = standalone in-memory database server. redis-py = Python package to connect to it. Example: 1000 users ask same query → only first runs full pipeline → rest get cached answer instantly.

Observability — every query traced end-to-end. Latency, chunk scores, token cost, failures all logged. Packages: OpenTelemetry, MLflow, Prometheus, Grafana

Key Outcomes

Zero Hallucination — Confidence gate + constrained prompt Low Latency — ANN 2ms + Redis cache High Retrieval Accuracy — Hybrid BM25 + Vector + Reranker Fully Auditable — Citations on every answer Production Ready — Observability + Evals + Versioning

Final Thought

RAG done right is a system design problem, not a model problem.

The LLM is just the last step. Everything before it — ingestion, chunking, retrieval, scoring, constraints — determines whether your system gives correct answers or confident hallucinations.

If you found this useful, follow for more production AI engineering content. Questions or feedback — drop a comment below.
📧 devathilokesh2001@gmail.com
💼 Connect on LinkedIn — https://www.linkedin.com/in/sailokesh-datascience-aiml/

RAG is Not Just Chunking + Embedding + Retrieval — Here's What Production Actually Looks Like

Stage 1 — Ingest + Normalize

Stage 2 — Chunking Strategy

Embedding

Stage 3 — Hybrid Retrieval (BM25 + Vector)

Stage 4 — ANN + Cross-Encoder Reranking

Stage 5 — Confidence Scoring

Stage 6 — Constrained Generation

Stage 7 — Citation-Backed Responses

Supporting Layers

Key Outcomes

Final Thought

Comments

AI in Production

AI Agents in Production — What Actually Breaks

More from this blog

The 5 Layers of Agent Memory — What Every Production Agent Needs

# Not Every RAG System Needs a Vector Database

AI Agents in Production — What Actually Breaks

Command Palette

Stage 1 — Ingest + Normalize

Stage 2 — Chunking Strategy

Embedding

Stage 3 — Hybrid Retrieval (BM25 + Vector)

Stage 4 — ANN + Cross-Encoder Reranking

Stage 5 — Confidence Scoring

Stage 6 — Constrained Generation

Stage 7 — Citation-Backed Responses

Supporting Layers

Key Outcomes

Final Thought

Comments

AI in Production

AI Agents in Production — What Actually Breaks

More from this blog