Problem: 

   LLM = Large Language Model
   > Base architecture
     > Limitations
       > Knowledge is static — frozen at training cutoff
       > No access to private or proprietary data
       > Context window is finite (though modern models have very large windows)
       > Can hallucinate facts it doesn't actually know

Solution:

   RAG = Retrieval-Augmented Generation (RAG) 


RAG = Retrieval-Augmented Generation (RAG) 
    = a pattern that addresses those limitations 
      by giving the LLM access to an external knowledge source 
         at inference time, 
         without retraining.

Here's the flow:

User Query
    │
    ▼
[Retriever] ──── searches ────► Vector DB / Document Store
    │                                   │
    │◄──── returns top-k chunks ────────┘
    │
    ▼
[Augmented Prompt]
= System prompt + retrieved chunks + user query
    │
    ▼
[LLM] ──► generates grounded response

How it works in practice:

Indexing (offline): 
   Documents are chunked, embedded into vectors using an embedding model, and 
   stored in a vector database (e.g., Pinecone, pgvector, Weaviate).

Retrieval (at query time): 
   The user's query is embedded, and a similarity search finds the most relevant chunks.
   
Augmentation: 
   Those chunks are injected into the LLM's context window alongside the query.
Generation: The LLM generates a response grounded in the retrieved content.

Why RAG instead of just fine-tuning?

|                      | RAG                    | Fine-tuning             |
| Knowledge updates    | Easy — just re-index   | Requires retraining     |
| Private/dynamic data | ✅ Natural fit         | ❌ Poor fit            |
| Factual grounding    | High (citable sources) | Lower                   |
| Cost                 | Low (inference only)   | High (training compute) | 
| Reasoning style     | Unchanged               | Can be adapted          |

RAG's architectural position is 
   essentially a retrieval layer 
   that sits between the user and the LLM 
   — 
   it doesn't change the model's weights or architecture at all. 
   It simply makes the context window smarter by 
      filling it with relevant, up-to-date information on demand.

In production systems, 
RAG is often combined with other techniques like 
   query rewriting, 
   re-ranking (a second model scores retrieved chunks for relevance), and 
   agentic tool use where the LLM can decide when to retrieve.

RAG = Retrieval Augmented Generation
> pipelines 
  > LangChain 
  > LlamaIndex 

Applications
> document retrieval
RAG (Retrieval Augmented Generation) 
   bridges the gap between 
      what an LLM was trained on and 
      what a business actually needs — real-time, proprietary, up-to-date knowledge.

Core working principle:

   1. Ingestion: 
      Source documents (PDFs, databases, web pages) are chunked and 
      converted into vector embeddings, then 
      stored in a vector database (e.g. Pinecone, Weaviate, ChromaDB).

   2. Retrieval: 
      When a user submits a query, it is also embedded and 
      used to perform a semantic similarity search against the vector store 
      — returning the most relevant document chunks.

   3. Augmentation: 
      These retrieved chunks are injected into the LLM's prompt as context.

   4. Generation: 
      The LLM generates a response grounded in the retrieved context, reducing hallucination.
Key technical challenges in enterprise deployment:

   • Chunking strategy: 
     Chunk size and overlap significantly affect retrieval quality. 
     Too small loses context; too large dilutes relevance.

   • Embedding model selection: 
     The embedding model must match the domain 
     — general-purpose embeddings underperform on financial or legal documents.

   • Retrieval quality: 
     Hybrid search (combining dense vector search with sparse BM25) 
     typically outperforms 
     pure vector search 
     in enterprise settings.

   • Latency and scalability: 
     Production RAG must handle concurrent queries with low latency 
     — requiring careful indexing and caching strategies.

   • Data freshness: 
     Enterprise data changes. 
     A re-indexing pipeline must be designed to keep the vector store current.

   • Security and access control: 
     In regulated industries (e.g. finance), 
     retrieved chunks must respect document-level permissions 
     — users should not retrieve data they are not authorised to see.