RAG is Not Just Chunking + Embedding + Retrieval — Here's What Production Actually Looks Like
A complete breakdown of enterprise-grade RAG pipeline with packages, architecture, and real engineering decisions

Most engineers learn RAG like this:
Chunk → Embed → Store → Retrieve → LLM → Done.
That works for a weekend project. It fails at 10M+ documents in production.
Here's what production RAG actually looks like — every stage, every package, every decision explained.
Stage 1 — Ingest + Normalize
Before any chunking or embedding happens, documents must be cleaned and standardized.
What it covers:
Deduplication — hash content, skip duplicates
Format standardization — PDF, DOCX, HTML, TXT all converted to plain text
Metadata extraction — title, author, date, source tagged on every document
Language detection — route to correct embedding model
Versioning — when document updates, mark old chunks stale
Example: Two files — "policy_2024.pdf" and "policy_2024_copy.pdf" with identical content. Without deduplication → both indexed → retrieval returns duplicate chunks → wrong answer repeated twice. With deduplication → hash matches → second file skipped → clean index.
Packages: pdfplumber, python-docx, beautifulsoup4, hashlib, langdetect
Stage 2 — Chunking Strategy
Chunking happens once during ingestion — offline, not at query time.
5 sub-steps:
Document Partitioning — detect headings, sections, preserve hierarchy
Chunk Size + Overlap — 300–800 tokens, 10–20% overlap prevents boundary loss
Semantic Chunking — embed sentences, split where cosine similarity drops
Special Handling — tables stay together with header, code splits at function boundaries
Metadata Enrichment — every chunk gets doc_id, section, page, timestamp
Example — why overlap matters: Without overlap: Chunk 1: "The request was rejected because..." Chunk 2: "...the customer failed to submit documents." → Answer is split across two chunks. If only Chunk 1 is retrieved → LLM sees incomplete reason.
With overlap: Chunk 1: "The request was rejected because the customer failed..." Chunk 2: "...the customer failed to submit documents on time." → Answer is complete in both chunks. Retrieval always finds the full context.
Packages: nltk, tiktoken, sentence-transformers, unstructured, ast, pandas
Embedding
After chunking, each chunk is embedded using SBERT and stored. This happens once offline. At query time — only the query gets embedded using the same model.
Rule: Same model for chunk embedding AND query embedding. Different models = broken retrieval.
Example: Chunk embedded with: all-MiniLM-L6-v2 → vector [0.23, 0.87, 0.11 ...] Query embedded with: different model → vector [0.91, 0.02, 0.76 ...] → Cosine similarity is meaningless — vectors are in different spaces → retrieval completely breaks.
Packages: sentence-transformers (all-MiniLM-L6-v2), faiss, Azure AI Search
Stage 3 — Hybrid Retrieval (BM25 + Vector)
Neither BM25 nor vector search alone is enough.
BM25 — keyword scorer. Finds exact terms, invoice numbers, product codes. No embeddings.
Vector — semantic scorer. Finds "cardiac arrest" when you search "heart attack".
RRF (Reciprocal Rank Fusion) — merges both ranked lists by position, not score.
Example: Query: "invoice #INV-2024-991 status" → Vector search misses it — semantic model doesn't memorize exact invoice numbers. → BM25 finds it instantly — exact string match.
Query: "heart attack coverage" → BM25 scores ZERO on chunk saying "cardiac arrest" — different words. → Vector finds it — same meaning, different words.
Hybrid covers both cases. RRF merges by rank position because BM25 scores (0–10) and cosine scores (0–1) cannot be directly compared.
Packages: rank_bm25, nltk, faiss, sentence-transformers
Stage 4 — ANN + Cross-Encoder Reranking
Two-stage funnel:
ANN (Approximate Nearest Neighbor) — 10M chunks → top 50 in 2ms using HNSW graph index
Cross-Encoder Reranker — reads query + chunk together, deep relevance score, returns top 10
Why not run reranker on all 10M? At 200ms per pair → 200ms × 10M = 23 days per query. Impossible. ANN gets you to 50 fast. Reranker gets you to 10 accurately.
Example — before vs after reranking: Query: "What covers cardiac arrest?"
Before reranking (cosine order):
Fire coverage 0.81
Flood coverage 0.79
Cardiac arrest 0.76
After reranking (cross-encoder order):
Cardiac arrest 0.97 ← correctly moved to top
Fire coverage 0.41 ← correctly pushed down
Flood coverage 0.38 ← correctly pushed down
Cross-encoder reads query + chunk together — understands fire is not relevant to cardiac arrest.
Packages: faiss (IndexHNSWFlat), sentence-transformers (CrossEncoder), cross-encoder/ms-marco-MiniLM-L-6-v2
Stage 5 — Confidence Scoring
Before the LLM sees anything, every chunk gets scored on:
Freshness — how old is the document
Authority — official source vs random wiki
Overlap consistency — do other chunks agree or contradict
Retrieval consistency — was it strong in both BM25 and vector
Example: Chunk A: "Fire coverage limit is $500,000" — from official 2024 document → score: 0.93 → PASS Chunk B: "Fire coverage limit is $200,000" — from 2017 wiki page → score: 0.31 → FILTERED OUT
Low confidence chunks are filtered out. If zero chunks pass → system returns "Insufficient Evidence" instead of hallucinating.
No external package needed. Uses metadata + embeddings already computed in earlier stages.
Stage 6 — Constrained Generation
LLM answers only from trusted chunks. System prompt enforces this:
"Answer ONLY using the context below. Do NOT use outside knowledge. If answer not in context, say I don't know."
Temperature = 0 for deterministic output.
Example: Context given: "Fire coverage limit is $500,000 per incident." Query: "What is the fire coverage limit?" LLM: "Based on the provided document, fire coverage limit is $500,000 per incident." → No guessing. No training data used. Pure context answer.
Packages: LangChain, LangGraph, OpenAI API
Stage 7 — Citation-Backed Responses
Every sentence linked to source document, section, and page. Chunk metadata (doc_id, title, section, page) flows from Stage 1 all the way to the final response.
Example: "Fire coverage limit is $500,000 [1]" [1] policy_2024.pdf — Section: Coverage Details — Page 12
Auditors open page 12 and verify the answer directly. Fully auditable.
Supporting Layers
Continuous Evals — Precision@K, Recall@K, Faithfulness, Hallucination Rate tracked automatically Packages: RAGAS, DeepEval, MLflow
Caching — Redis stores frequent query results. Same query never runs full pipeline twice. Redis = standalone in-memory database server. redis-py = Python package to connect to it. Example: 1000 users ask same query → only first runs full pipeline → rest get cached answer instantly.
Observability — every query traced end-to-end. Latency, chunk scores, token cost, failures all logged. Packages: OpenTelemetry, MLflow, Prometheus, Grafana
Key Outcomes
Zero Hallucination — Confidence gate + constrained prompt Low Latency — ANN 2ms + Redis cache High Retrieval Accuracy — Hybrid BM25 + Vector + Reranker Fully Auditable — Citations on every answer Production Ready — Observability + Evals + Versioning
Final Thought
RAG done right is a system design problem, not a model problem.
The LLM is just the last step. Everything before it — ingestion, chunking, retrieval, scoring, constraints — determines whether your system gives correct answers or confident hallucinations.
If you found this useful, follow for more production AI engineering content. Questions or feedback — drop a comment below.
📧 devathilokesh2001@gmail.com
💼 Connect on LinkedIn — https://www.linkedin.com/in/sailokesh-datascience-aiml/



