Production RAG Systems

Overview

While Large Language Models (LLMs) have revolutionized natural language processing with their ability to generate coherent text and reason across domains, they face fundamental limitations. LLMs can only access knowledge encoded in their parameters during training, leading to potential hallucinations, outdated information, and inability to access domain-specific knowledge.

Retrieval-Augmented Generation (RAG) addresses these limitations by combining the generative power of LLMs with the ability to retrieve and leverage external knowledge sources. By dynamically accessing relevant information during inference, RAG systems enhance model outputs with accuracy, currency, and verifiability that pure LLMs cannot achieve alone.

This lesson explores the foundations of RAG, its components, implementation approaches, and practical applications. We'll build intuitive understanding through analogies and visualizations, then gradually introduce more technical depth and hands-on implementation.

Learning Objectives

After completing this lesson, you will be able to:

Understand the motivation and principles behind Retrieval-Augmented Generation
Describe the core components of RAG systems: embedding generation, chunking, vector storage, retrieval, and generation
Implement a basic RAG system using popular libraries and tools
Evaluate and improve RAG performance through rerankers and other optimization techniques
Apply RAG to specific use cases and domains
Compare different RAG architectures and understand their trade-offs

Why RAG? Understanding the Need for External Knowledge

The Knowledge Access Problem

Large Language Models face several key limitations regarding knowledge:

Static Knowledge: LLMs only "know" what they learned during training
Knowledge Cutoff: Information after the training cutoff is inaccessible
Hallucinations: Models may generate plausible but factually incorrect information
Lack of Citations: Difficult to verify the source of generated information
Domain Knowledge Gaps: Limited expertise in specialized domains

Analogy: The Expert Consultant with a Library

Think of an LLM as an expert consultant who has read many books but:

Cannot access any new books published after their last education
Must rely solely on memory for all facts and details
Has no way to verify their recollection against original sources
Cannot easily expand knowledge into new specialized domains

RAG transforms this consultant by providing:

A vast, current library that can be instantly searched
The ability to read specific sources before responding
Citations to verify information
Domain-specific resources that can be added on demand

From Memory-Only to Memory+Retrieval

Aspect	LLM Only	LLM + RAG
Knowledge Source	Parameters (frozen at training)	Parameters + External documents
Information Currency	Training cutoff date	As current as the knowledge base
Factual Accuracy	Varies, prone to hallucination	Higher, based on retrieved context
Verifiability	Low, no citations	High, can cite sources
Domain Adaptation	Requires fine-tuning	Add domain documents to knowledge base
Computation	Lower (generation only)	Higher (retrieval + generation)
Memory Usage	Fixed model size	Model + vector database

The RAG Architecture: A High-Level View

Core Components

Loading tool...

RAG systems consist of two main phases:

Indexing Phase: Prepare documents for efficient retrieval
Query Phase: Retrieve relevant information and augment LLM generation

Document Processing and Embedding Generation

Document Chunking: The Art of Segmentation

Effective RAG requires breaking down documents into appropriately sized pieces (chunks) that:

Are small enough to be processed efficiently
Are large enough to retain meaningful context
Preserve semantic coherence of the content

Interactive Visualization: Explore how tokenization affects chunking strategies:

Loading tool...

Common Chunking Strategies

Fixed-Size Chunking: Split by character or token count
- Simple but may break semantic units
Semantic Chunking: Split based on document structure
- Paragraphs, sections, or headings
- Preserves natural document organization
Recursive Chunking: Split hierarchically
- Preserve relationships between chunks
- Handle nested document structures
Sliding Window Chunking: Create overlapping chunks
- Ensures context is preserved across chunk boundaries
- Increases storage requirements

Embedding Generation: Turning Text into Vectors

Embeddings are numerical representations of text in a high-dimensional vector space, where semantic similarity is captured by vector proximity.

Understanding Vector Similarity in RAG

The core of RAG retrieval is finding documents with embeddings similar to the query embedding. Let's visualize how this actually works:

Loading tool...

Choosing the Right Embedding Model

Model	Dimensions	Context Length	Performance	Speed	Use Case
OpenAI ada-002	1536	8192	High	Medium	General purpose
BERT	768	512	Medium	Fast	Domain-specific
E5-large	1024	512	High	Medium	Retrieval-optimized
Sentence-T5	768	512	High	Fast	Multilingual
GTE-large	1024	512	Very High	Medium	MTEB leader
INSTRUCTOR	768	512	High	Medium	Instruction-tuned
BGE	1024	512	Very High	Medium	Chinese + English

Analogy: Library Catalog System

Think of embeddings like a modern library catalog system:

Each document is assigned coordinates in a multidimensional space
Similar documents are placed near each other
When someone asks a question, the system finds documents at coordinates similar to the question
This allows quick retrieval without having to read through all documents

Vector Storage and Indexing

Vector databases store and index embeddings for efficient similarity search:

Exact Nearest Neighbor Search:
- Computes distances between query and all vectors
- Accurate but slow for large collections
Approximate Nearest Neighbor (ANN) Search:
- Uses algorithms like HNSW, IVF, or LSH
- Trades perfect accuracy for speed
- Enables scalable similarity search

Common Vector Database Options

Database	Type	ANN Algorithms	Hosting Options	Features	Use Case
Pinecone	Managed	HNSW	Cloud-only	Metadata filtering, namespaces	Production ready
Weaviate	Full-featured	HNSW	Self-host/Cloud	Multi-modal, classes, schema	Complex data models
Chroma	Lightweight	HNSW	Self-host/Embedded	Simple API, Python-native	Development
FAISS	Library	Multiple	Self-host	High performance, customizable	Research
Qdrant	Full-featured	HNSW	Self-host/Cloud	Payload filtering, clustering	Production
Milvus	Full-featured	Multiple	Self-host/Cloud	Hybrid search, sharding	Large scale
pgvector	Database extension	IVF	Self-host	PostgreSQL integration	Existing PostgreSQL users

Retrieval Mechanisms: Finding the Right Context

Vector Search: Similarity Metrics

Different distance measures for finding similar vectors:

Cosine Similarity:
- Measures angle between vectors
- Scale-invariant
- Most common for text embeddings
- Formula: $\cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}||\mathbf{B}|}$
Euclidean Distance:
- Measures straight-line distance
- Affected by vector magnitude
- Formula: $d(\mathbf{A}, \mathbf{B}) = \sqrt{\sum_i (A_i - B_i)^2}$
Dot Product:
- Simple multiplication of vector elements
- Not normalized
- Formula: $\mathbf{A} \cdot \mathbf{B} = \sum_i A_i B_i$

Beyond Simple Retrieval: Advanced Techniques

1. Hybrid Search

Combines semantic search with keyword-based (sparse) search:

Semantic search captures meaning
Keyword search captures specific terms
Combined for better precision and recall

Loading tool...

2. Reranking

Reranking applies a second, more computationally intensive model to improve retrieval quality:

Initial retrieval fetches candidate documents (often 20-100)
Reranker evaluates each candidate more thoroughly
Documents are reordered based on relevance scores

Popular rerankers:

Cohere Rerank
BGE Reranker
UNI-COIL
MonoT5

3. Query Transformation

Techniques to improve the query before retrieval:

Query Expansion:
- Add related terms to the query
- Example: "car" → "car automobile vehicle"
HyDE (Hypothetical Document Embeddings):
- Use LLM to generate a hypothetical perfect document
- Embed this document as the query
Multi-Query Retrieval:
- Generate multiple perspectives on the query
- Combine retrieval results
- Increases recall at the cost of more processing

Prompt Engineering for RAG

Constructing Effective Prompts

The prompt structure for RAG typically includes:

System Instructions: Define the role and behavior of the assistant
Retrieved Context: External knowledge from vector search
User Query: The original question or instruction
Response Format: Structure for the model's output

Example RAG Prompt Template

def create_rag_prompt(query, context_docs, system_instruction=None):
    """
    Create a RAG prompt with retrieved context.
    
    Args:
        query: User's query
        context_docs: Retrieved documents/passages
        system_instruction: Optional system instruction
    
    Returns:
        Formatted prompt for the LLM
    """
    # Default system instruction if not provided
    if system_instruction is None:
        system_instruction = (
            "You are a helpful, accurate assistant. "
            "Use the provided context to answer the user's question. "
            "If the answer cannot be determined from the context, say 'I don't have enough information to answer this question.'
"
            "Always cite your sources by referring to the document ID [docX] for any information you use."
        )
    
    # Format retrieved documents
    formatted_context = "

".join([f"[doc{i}] {doc.text}" for i, doc in enumerate(context_docs)])
    
    # Construct full prompt
    prompt = f"""{system_instruction}
    
    Context information:
    {formatted_context}
    
    User Question: {query}
    
    Your response:
    """
    
    return prompt

Implementing a Basic RAG System

Setting Up a RAG Pipeline

Let's implement a complete RAG system using popular libraries:

import os
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.llms import OpenAI

# Set up environment
os.environ["OPENAI_API_KEY"] = "sk-your-api-key"  # Replace with your API key

# 1. Document Loading
def load_documents(directory_path):
    loader = DirectoryLoader(directory_path, glob="**/*.txt", loader_cls=TextLoader)
    documents = loader.load()
    print(f"Loaded {len(documents)} documents")
    return documents

# 2. Document Chunking
def chunk_documents(documents):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100,
        length_function=len
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Created {len(chunks)} chunks")
    return chunks

# 3. Create Vector Store
def create_vector_store(chunks):
    embeddings = OpenAIEmbeddings()
    vector_store = Chroma.from_documents(documents=chunks, embedding=embeddings)
    print("Vector store created")
    return vector_store

# 4. Set up Retrieval QA Chain
def setup_qa_chain(vector_store):
    retriever = vector_store.as_retriever(search_kwargs={"k": 4})
    qa_chain = RetrievalQA.from_chain_type(
        llm=OpenAI(),
        chain_type="stuff",
        retriever=retriever
    )
    return qa_chain

# 5. Answer Questions
def answer_question(qa_chain, query):
    response = qa_chain({"query": query})
    return response["result"]

# Main Pipeline
def run_rag_pipeline(directory_path, query):
    documents = load_documents(directory_path)
    chunks = chunk_documents(documents)
    vector_store = create_vector_store(chunks)
    qa_chain = setup_qa_chain(vector_store)
    answer = answer_question(qa_chain, query)
    return answer

# Example usage
if __name__ == "__main__":
    result = run_rag_pipeline("./documents", "What are the key features of RAG systems?")
    print(result)

More Sophisticated RAG Implementation

Here's a more advanced implementation with reranking:

import os
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader, TextLoader, PDFMinerLoader
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Set up environment (replace with your API keys)
os.environ["OPENAI_API_KEY"] = "sk-your-api-key"
os.environ["COHERE_API_KEY"] = "your-cohere-key"

# Define document loaders for different file types
def get_loader_for_file(file_path):
    if file_path.endswith(".pdf"):
        return PDFMinerLoader(file_path)
    else:
        return TextLoader(file_path)

# 1. Enhanced Document Loading
def load_documents(directory_path):
    loader = DirectoryLoader(
        directory_path,
        glob="**/*.*",
        loader_cls=lambda file_path: get_loader_for_file(file_path)
    )
    documents = loader.load()
    print(f"Loaded {len(documents)} documents")
    return documents

# 2. Improved Document Chunking
def chunk_documents(documents):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=["

", "
", " ", ""],
        length_function=len
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Created {len(chunks)} chunks")
    return chunks

# 3. Create Vector Store
def create_vector_store(chunks):
    embeddings = OpenAIEmbeddings()
    vector_store = Chroma.from_documents(documents=chunks, embedding=embeddings)
    print("Vector store created")
    return vector_store

# 4. Set up Advanced Retrieval with Reranking
def setup_advanced_retriever(vector_store):
    # First-stage retrieval (over-retrieve)
    base_retriever = vector_store.as_retriever(search_kwargs={"k": 20})
    
    # Add reranking for improved precision
    compressor = CohereRerank()
    reranking_retriever = ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=base_retriever
    )
    
    # Add query expansion for improved recall
    llm = ChatOpenAI(temperature=0)
    multi_query_retriever = MultiQueryRetriever.from_llm(
        retriever=reranking_retriever,
        llm=llm
    )
    
    return multi_query_retriever

# 5. Custom RAG Prompt Template
def create_rag_prompt_template():
    template = """You are an assistant for question-answering tasks. 
    Use the following pieces of retrieved context to answer the question. 
    If you don't know the answer, just say that you don't know. 
    Use three sentences maximum and keep the answer concise.
    
    Question: {question}
    
    Context:
    {context}
    
    Answer:"""
    
    return PromptTemplate(
        template=template,
        input_variables=["context", "question"]
    )

# 6. Setup QA Chain with Custom Prompt
def setup_qa_chain(retriever):
    prompt = create_rag_prompt_template()
    
    qa_chain = RetrievalQA.from_chain_type(
        llm=ChatOpenAI(temperature=0, model="gpt-4"),
        chain_type="stuff",
        retriever=retriever,
        chain_type_kwargs={"prompt": prompt}
    )
    
    return qa_chain

# 7. Answer Questions
def answer_question(qa_chain, query):
    response = qa_chain({"query": query})
    return response["result"]

# Main Pipeline
def run_advanced_rag_pipeline(directory_path, query):
    documents = load_documents(directory_path)
    chunks = chunk_documents(documents)
    vector_store = create_vector_store(chunks)
    retriever = setup_advanced_retriever(vector_store)
    qa_chain = setup_qa_chain(retriever)
    answer = answer_question(qa_chain, query)
    return answer

RAG Evaluation and Optimization

Evaluating RAG System Performance

Effective RAG evaluation should consider multiple dimensions:

Retrieval Metrics:
- Precision: Are retrieved documents relevant?
- Recall: Are all relevant documents retrieved?
- Mean Average Precision (MAP): Ranking quality
Generation Quality Metrics:
- Faithfulness: Does output align with retrieved information?
- Answer Relevance: Does output address the query?
- Groundedness: Is the output supported by evidence?
End-to-End Metrics:
- Correctness: Is the final answer factually correct?
- Helpfulness: Does it solve the user's problem?
- Latency: Is retrieval + generation time acceptable?

from ragas.metrics import (
    faithfulness, answer_relevancy, context_relevancy, 
    context_recall, context_precision
)
from ragas.langchain import RagasEvaluatorChain
from datasets import Dataset

# Example evaluation data
eval_data = [
    {
        "query": "What are the key components of RAG?",
        "contexts": ["RAG consists of retrieval and generation components...", "..."],
        "answer": "The key components of RAG are the retriever, which finds relevant documents, and the generator, which produces answers based on retrieved context.",
        "ground_truth": "RAG systems comprise retrieval mechanisms that find documents and generation components that produce answers."
    },
    # More evaluation examples
]

# Convert to HuggingFace dataset
dataset = Dataset.from_list(eval_data)

# Create evaluation chains
faithfulness_chain = RagasEvaluatorChain(metric=faithfulness)
answer_relevancy_chain = RagasEvaluatorChain(metric=answer_relevancy)
context_relevancy_chain = RagasEvaluatorChain(metric=context_relevancy)
context_recall_chain = RagasEvaluatorChain(metric=context_recall)
context_precision_chain = RagasEvaluatorChain(metric=context_precision)

# Example evaluation function
def evaluate_rag_system(dataset):
    results = {
        "faithfulness": [],
        "answer_relevancy": [],
        "context_relevancy": [],
        "context_recall": [],
        "context_precision": []
    }
    
    for item in dataset:
        # Evaluate each metric
        faith_result = faithfulness_chain.run(
            query=item["query"],
            answer=item["answer"],
            contexts=item["contexts"]
        )
        results["faithfulness"].append(faith_result["score"])
        
        # Similar for other metrics...
    
    # Calculate average scores
    avg_results = {k: sum(v)/len(v) for k, v in results.items()}
    return avg_results

Optimizing RAG Performance

Interactive Chunking Strategy Optimization

Understanding the optimal chunk size is crucial for RAG performance. Let's explore how different chunk sizes affect retrieval precision, answer quality, and processing speed:

Loading tool...

Advanced NLP: Training & Production Systems

Production RAG Systems

Overview

Learning Objectives

Why RAG? Understanding the Need for External Knowledge

The Knowledge Access Problem

Analogy: The Expert Consultant with a Library

From Memory-Only to Memory+Retrieval

The RAG Architecture: A High-Level View

Core Components

Document Processing and Embedding Generation

Document Chunking: The Art of Segmentation

Common Chunking Strategies

Embedding Generation: Turning Text into Vectors

Understanding Vector Similarity in RAG

Choosing the Right Embedding Model

Analogy: Library Catalog System

Vector Storage and Indexing

Common Vector Database Options

Retrieval Mechanisms: Finding the Right Context

Vector Search: Similarity Metrics

Beyond Simple Retrieval: Advanced Techniques

1. Hybrid Search

2. Reranking

3. Query Transformation

Prompt Engineering for RAG

Constructing Effective Prompts

Example RAG Prompt Template

Implementing a Basic RAG System

Setting Up a RAG Pipeline

More Sophisticated RAG Implementation

RAG Evaluation and Optimization

Evaluating RAG System Performance

Optimizing RAG Performance

Interactive Chunking Strategy Optimization