Skip to main content

Command Palette

Search for a command to run...

RAG is Not Just Chunking + Embedding + Retrieval — Here's What Production Actually Looks Like

A complete breakdown of enterprise-grade RAG pipeline with packages, architecture, and real engineering decisions

Updated
6 min read
RAG is Not Just Chunking + Embedding + Retrieval — Here's What Production Actually Looks Like
D
AI/ML & MLOps Engineer. I build production pipelines and LLM systems. Writing about real-world AI engineering.

Most engineers learn RAG like this:

Chunk → Embed → Store → Retrieve → LLM → Done.

That works for a weekend project. It fails at 10M+ documents in production.

Here's what production RAG actually looks like — every stage, every package, every decision explained.


Stage 1 — Ingest + Normalize

Before any chunking or embedding happens, documents must be cleaned and standardized.

What it covers:

  • Deduplication — hash content, skip duplicates

  • Format standardization — PDF, DOCX, HTML, TXT all converted to plain text

  • Metadata extraction — title, author, date, source tagged on every document

  • Language detection — route to correct embedding model

  • Versioning — when document updates, mark old chunks stale

Example: Two files — "policy_2024.pdf" and "policy_2024_copy.pdf" with identical content. Without deduplication → both indexed → retrieval returns duplicate chunks → wrong answer repeated twice. With deduplication → hash matches → second file skipped → clean index.

Packages: pdfplumber, python-docx, beautifulsoup4, hashlib, langdetect


Stage 2 — Chunking Strategy

Chunking happens once during ingestion — offline, not at query time.

5 sub-steps:

  • Document Partitioning — detect headings, sections, preserve hierarchy

  • Chunk Size + Overlap — 300–800 tokens, 10–20% overlap prevents boundary loss

  • Semantic Chunking — embed sentences, split where cosine similarity drops

  • Special Handling — tables stay together with header, code splits at function boundaries

  • Metadata Enrichment — every chunk gets doc_id, section, page, timestamp

Example — why overlap matters: Without overlap: Chunk 1: "The request was rejected because..." Chunk 2: "...the customer failed to submit documents." → Answer is split across two chunks. If only Chunk 1 is retrieved → LLM sees incomplete reason.

With overlap: Chunk 1: "The request was rejected because the customer failed..." Chunk 2: "...the customer failed to submit documents on time." → Answer is complete in both chunks. Retrieval always finds the full context.

Packages: nltk, tiktoken, sentence-transformers, unstructured, ast, pandas


Embedding

After chunking, each chunk is embedded using SBERT and stored. This happens once offline. At query time — only the query gets embedded using the same model.

Rule: Same model for chunk embedding AND query embedding. Different models = broken retrieval.

Example: Chunk embedded with: all-MiniLM-L6-v2 → vector [0.23, 0.87, 0.11 ...] Query embedded with: different model → vector [0.91, 0.02, 0.76 ...] → Cosine similarity is meaningless — vectors are in different spaces → retrieval completely breaks.

Packages: sentence-transformers (all-MiniLM-L6-v2), faiss, Azure AI Search


Stage 3 — Hybrid Retrieval (BM25 + Vector)

Neither BM25 nor vector search alone is enough.

  • BM25 — keyword scorer. Finds exact terms, invoice numbers, product codes. No embeddings.

  • Vector — semantic scorer. Finds "cardiac arrest" when you search "heart attack".

  • RRF (Reciprocal Rank Fusion) — merges both ranked lists by position, not score.

Example: Query: "invoice #INV-2024-991 status" → Vector search misses it — semantic model doesn't memorize exact invoice numbers. → BM25 finds it instantly — exact string match.

Query: "heart attack coverage" → BM25 scores ZERO on chunk saying "cardiac arrest" — different words. → Vector finds it — same meaning, different words.

Hybrid covers both cases. RRF merges by rank position because BM25 scores (0–10) and cosine scores (0–1) cannot be directly compared.

Packages: rank_bm25, nltk, faiss, sentence-transformers


Stage 4 — ANN + Cross-Encoder Reranking

Two-stage funnel:

  • ANN (Approximate Nearest Neighbor) — 10M chunks → top 50 in 2ms using HNSW graph index

  • Cross-Encoder Reranker — reads query + chunk together, deep relevance score, returns top 10

Why not run reranker on all 10M? At 200ms per pair → 200ms × 10M = 23 days per query. Impossible. ANN gets you to 50 fast. Reranker gets you to 10 accurately.

Example — before vs after reranking: Query: "What covers cardiac arrest?"

Before reranking (cosine order):

  1. Fire coverage 0.81

  2. Flood coverage 0.79

  3. Cardiac arrest 0.76

After reranking (cross-encoder order):

  1. Cardiac arrest 0.97 ← correctly moved to top

  2. Fire coverage 0.41 ← correctly pushed down

  3. Flood coverage 0.38 ← correctly pushed down

Cross-encoder reads query + chunk together — understands fire is not relevant to cardiac arrest.

Packages: faiss (IndexHNSWFlat), sentence-transformers (CrossEncoder), cross-encoder/ms-marco-MiniLM-L-6-v2


Stage 5 — Confidence Scoring

Before the LLM sees anything, every chunk gets scored on:

  • Freshness — how old is the document

  • Authority — official source vs random wiki

  • Overlap consistency — do other chunks agree or contradict

  • Retrieval consistency — was it strong in both BM25 and vector

Example: Chunk A: "Fire coverage limit is $500,000" — from official 2024 document → score: 0.93 → PASS Chunk B: "Fire coverage limit is $200,000" — from 2017 wiki page → score: 0.31 → FILTERED OUT

Low confidence chunks are filtered out. If zero chunks pass → system returns "Insufficient Evidence" instead of hallucinating.

No external package needed. Uses metadata + embeddings already computed in earlier stages.


Stage 6 — Constrained Generation

LLM answers only from trusted chunks. System prompt enforces this:

"Answer ONLY using the context below. Do NOT use outside knowledge. If answer not in context, say I don't know."

Temperature = 0 for deterministic output.

Example: Context given: "Fire coverage limit is $500,000 per incident." Query: "What is the fire coverage limit?" LLM: "Based on the provided document, fire coverage limit is $500,000 per incident." → No guessing. No training data used. Pure context answer.

Packages: LangChain, LangGraph, OpenAI API


Stage 7 — Citation-Backed Responses

Every sentence linked to source document, section, and page. Chunk metadata (doc_id, title, section, page) flows from Stage 1 all the way to the final response.

Example: "Fire coverage limit is $500,000 [1]" [1] policy_2024.pdf — Section: Coverage Details — Page 12

Auditors open page 12 and verify the answer directly. Fully auditable.


Supporting Layers

Continuous Evals — Precision@K, Recall@K, Faithfulness, Hallucination Rate tracked automatically Packages: RAGAS, DeepEval, MLflow

Caching — Redis stores frequent query results. Same query never runs full pipeline twice. Redis = standalone in-memory database server. redis-py = Python package to connect to it. Example: 1000 users ask same query → only first runs full pipeline → rest get cached answer instantly.

Observability — every query traced end-to-end. Latency, chunk scores, token cost, failures all logged. Packages: OpenTelemetry, MLflow, Prometheus, Grafana


Key Outcomes

Zero Hallucination — Confidence gate + constrained prompt Low Latency — ANN 2ms + Redis cache High Retrieval Accuracy — Hybrid BM25 + Vector + Reranker Fully Auditable — Citations on every answer Production Ready — Observability + Evals + Versioning


Final Thought

RAG done right is a system design problem, not a model problem.

The LLM is just the last step. Everything before it — ingestion, chunking, retrieval, scoring, constraints — determines whether your system gives correct answers or confident hallucinations.


If you found this useful, follow for more production AI engineering content. Questions or feedback — drop a comment below.
📧 devathilokesh2001@gmail.com
💼 Connect on LinkedIn — https://www.linkedin.com/in/sailokesh-datascience-aiml/

AI in Production

Part 1 of 4

A practical series on building and shipping AI systems that actually work — RAG pipelines, agents, observability, and MLOps. No theory, no toy examples. Real patterns, real failures, real fixes.

Up next

AI Agents in Production — What Actually Breaks

After studying production AI systems, reading real post-mortems, and building pipelines on enterprise data — one pattern stands out. Everyone talks about building agents. Nobody talks about what break