Problem:
LLM = Large Language Model
> Base architecture
> Limitations
> Knowledge is static — frozen at training cutoff
> No access to private or proprietary data
> Context window is finite (though modern models have very large windows)
> Can hallucinate facts it doesn't actually know
Solution:
RAG = Retrieval-Augmented Generation (RAG)
RAG = Retrieval-Augmented Generation (RAG)
= a pattern that addresses those limitations
by giving the LLM access to an external knowledge source
at inference time,
without retraining.
Here's the flow:
User Query
│
▼
[Retriever] ──── searches ────► Vector DB / Document Store
│ │
│◄──── returns top-k chunks ────────┘
│
▼
[Augmented Prompt]
= System prompt + retrieved chunks + user query
│
▼
[LLM] ──► generates grounded response
How it works in practice:
Indexing (offline):
Documents are chunked, embedded into vectors using an embedding model, and
stored in a vector database (e.g., Pinecone, pgvector, Weaviate).
Retrieval (at query time):
The user's query is embedded, and a similarity search finds the most relevant chunks.
Augmentation:
Those chunks are injected into the LLM's context window alongside the query.
Generation: The LLM generates a response grounded in the retrieved content.
Why RAG instead of just fine-tuning?
| | RAG | Fine-tuning |
| Knowledge updates | Easy — just re-index | Requires retraining |
| Private/dynamic data | ✅ Natural fit | ❌ Poor fit |
| Factual grounding | High (citable sources) | Lower |
| Cost | Low (inference only) | High (training compute) |
| Reasoning style | Unchanged | Can be adapted |
RAG's architectural position is
essentially a retrieval layer
that sits between the user and the LLM
—
it doesn't change the model's weights or architecture at all.
It simply makes the context window smarter by
filling it with relevant, up-to-date information on demand.
In production systems,
RAG is often combined with other techniques like
query rewriting,
re-ranking (a second model scores retrieved chunks for relevance), and
agentic tool use where the LLM can decide when to retrieve.
RAG = Retrieval Augmented Generation > pipelines > LangChain > LlamaIndex Applications > document retrieval
RAG (Retrieval Augmented Generation)
bridges the gap between
what an LLM was trained on and
what a business actually needs — real-time, proprietary, up-to-date knowledge.
Core working principle:
1. Ingestion:
Source documents (PDFs, databases, web pages) are chunked and
converted into vector embeddings, then
stored in a vector database (e.g. Pinecone, Weaviate, ChromaDB).
2. Retrieval:
When a user submits a query, it is also embedded and
used to perform a semantic similarity search against the vector store
— returning the most relevant document chunks.
3. Augmentation:
These retrieved chunks are injected into the LLM's prompt as context.
4. Generation:
The LLM generates a response grounded in the retrieved context, reducing hallucination.
Key technical challenges in enterprise deployment:
• Chunking strategy:
Chunk size and overlap significantly affect retrieval quality.
Too small loses context; too large dilutes relevance.
• Embedding model selection:
The embedding model must match the domain
— general-purpose embeddings underperform on financial or legal documents.
• Retrieval quality:
Hybrid search (combining dense vector search with sparse BM25)
typically outperforms
pure vector search
in enterprise settings.
• Latency and scalability:
Production RAG must handle concurrent queries with low latency
— requiring careful indexing and caching strategies.
• Data freshness:
Enterprise data changes.
A re-indexing pipeline must be designed to keep the vector store current.
• Security and access control:
In regulated industries (e.g. finance),
retrieved chunks must respect document-level permissions
— users should not retrieve data they are not authorised to see.

