Overview
While Large Language Models (LLMs) have revolutionized natural language processing with their ability to generate coherent text and reason across domains, they face fundamental limitations. LLMs can only access knowledge encoded in their parameters during training, leading to potential hallucinations, outdated information, and inability to access domain-specific knowledge.
Retrieval-Augmented Generation (RAG) addresses these limitations by combining the generative power of LLMs with the ability to retrieve and leverage external knowledge sources. By dynamically accessing relevant information during inference, RAG systems enhance model outputs with accuracy, currency, and verifiability that pure LLMs cannot achieve alone.
This lesson explores the foundations of RAG, its components, implementation approaches, and practical applications. We'll build intuitive understanding through analogies and visualizations, then gradually introduce more technical depth and hands-on implementation.
Learning Objectives
After completing this lesson, you will be able to:
- Understand the motivation and principles behind Retrieval-Augmented Generation
- Describe the core components of RAG systems: embedding generation, chunking, vector storage, retrieval, and generation
- Implement a basic RAG system using popular libraries and tools
- Evaluate and improve RAG performance through rerankers and other optimization techniques
- Apply RAG to specific use cases and domains
- Compare different RAG architectures and understand their trade-offs
Why RAG? Understanding the Need for External Knowledge
The Knowledge Access Problem
Large Language Models face several key limitations regarding knowledge:
- Static Knowledge: LLMs only "know" what they learned during training
- Knowledge Cutoff: Information after the training cutoff is inaccessible
- Hallucinations: Models may generate plausible but factually incorrect information
- Lack of Citations: Difficult to verify the source of generated information
- Domain Knowledge Gaps: Limited expertise in specialized domains
Analogy: The Expert Consultant with a Library
Think of an LLM as an expert consultant who has read many books but:
- Cannot access any new books published after their last education
- Must rely solely on memory for all facts and details
- Has no way to verify their recollection against original sources
- Cannot easily expand knowledge into new specialized domains
RAG transforms this consultant by providing:
- A vast, current library that can be instantly searched
- The ability to read specific sources before responding
- Citations to verify information
- Domain-specific resources that can be added on demand
From Memory-Only to Memory+Retrieval
| Aspect | LLM Only | LLM + RAG |
|---|---|---|
| Knowledge Source | Parameters (frozen at training) | Parameters + External documents |
| Information Currency | Training cutoff date | As current as the knowledge base |
| Factual Accuracy | Varies, prone to hallucination | Higher, based on retrieved context |
| Verifiability | Low, no citations | High, can cite sources |
| Domain Adaptation | Requires fine-tuning | Add domain documents to knowledge base |
| Computation | Lower (generation only) | Higher (retrieval + generation) |
| Memory Usage | Fixed model size | Model + vector database |
The RAG Architecture: A High-Level View
Core Components
RAG systems consist of two main phases:
- Indexing Phase: Prepare documents for efficient retrieval
- Query Phase: Retrieve relevant information and augment LLM generation
Document Processing and Embedding Generation
Document Chunking: The Art of Segmentation
Effective RAG requires breaking down documents into appropriately sized pieces (chunks) that:
- Are small enough to be processed efficiently
- Are large enough to retain meaningful context
- Preserve semantic coherence of the content
Interactive Visualization: Explore how tokenization affects chunking strategies:
Common Chunking Strategies
-
Fixed-Size Chunking: Split by character or token count
- Simple but may break semantic units
-
Semantic Chunking: Split based on document structure
- Paragraphs, sections, or headings
- Preserves natural document organization
-
Recursive Chunking: Split hierarchically
- Preserve relationships between chunks
- Handle nested document structures
-
Sliding Window Chunking: Create overlapping chunks
- Ensures context is preserved across chunk boundaries
- Increases storage requirements
Embedding Generation: Turning Text into Vectors
Embeddings are numerical representations of text in a high-dimensional vector space, where semantic similarity is captured by vector proximity.
Understanding Vector Similarity in RAG
The core of RAG retrieval is finding documents with embeddings similar to the query embedding. Let's visualize how this actually works:
Choosing the Right Embedding Model
| Model | Dimensions | Context Length | Performance | Speed | Use Case |
|---|---|---|---|---|---|
| OpenAI ada-002 | 1536 | 8192 | High | Medium | General purpose |
| BERT | 768 | 512 | Medium | Fast | Domain-specific |
| E5-large | 1024 | 512 | High | Medium | Retrieval-optimized |
| Sentence-T5 | 768 | 512 | High | Fast | Multilingual |
| GTE-large | 1024 | 512 | Very High | Medium | MTEB leader |
| INSTRUCTOR | 768 | 512 | High | Medium | Instruction-tuned |
| BGE | 1024 | 512 | Very High | Medium | Chinese + English |
Analogy: Library Catalog System
Think of embeddings like a modern library catalog system:
- Each document is assigned coordinates in a multidimensional space
- Similar documents are placed near each other
- When someone asks a question, the system finds documents at coordinates similar to the question
- This allows quick retrieval without having to read through all documents
Vector Storage and Indexing
Vector databases store and index embeddings for efficient similarity search:
-
Exact Nearest Neighbor Search:
- Computes distances between query and all vectors
- Accurate but slow for large collections
-
Approximate Nearest Neighbor (ANN) Search:
- Uses algorithms like HNSW, IVF, or LSH
- Trades perfect accuracy for speed
- Enables scalable similarity search
Common Vector Database Options
| Database | Type | ANN Algorithms | Hosting Options | Features | Use Case |
|---|---|---|---|---|---|
| Pinecone | Managed | HNSW | Cloud-only | Metadata filtering, namespaces | Production ready |
| Weaviate | Full-featured | HNSW | Self-host/Cloud | Multi-modal, classes, schema | Complex data models |
| Chroma | Lightweight | HNSW | Self-host/Embedded | Simple API, Python-native | Development |
| FAISS | Library | Multiple | Self-host | High performance, customizable | Research |
| Qdrant | Full-featured | HNSW | Self-host/Cloud | Payload filtering, clustering | Production |
| Milvus | Full-featured | Multiple | Self-host/Cloud | Hybrid search, sharding | Large scale |
| pgvector | Database extension | IVF | Self-host | PostgreSQL integration | Existing PostgreSQL users |
Retrieval Mechanisms: Finding the Right Context
Vector Search: Similarity Metrics
Different distance measures for finding similar vectors:
-
Cosine Similarity:
- Measures angle between vectors
- Scale-invariant
- Most common for text embeddings
- Formula:
-
Euclidean Distance:
- Measures straight-line distance
- Affected by vector magnitude
- Formula:
-
Dot Product:
- Simple multiplication of vector elements
- Not normalized
- Formula:
Beyond Simple Retrieval: Advanced Techniques
1. Hybrid Search
Combines semantic search with keyword-based (sparse) search:
- Semantic search captures meaning
- Keyword search captures specific terms
- Combined for better precision and recall
2. Reranking
Reranking applies a second, more computationally intensive model to improve retrieval quality:
- Initial retrieval fetches candidate documents (often 20-100)
- Reranker evaluates each candidate more thoroughly
- Documents are reordered based on relevance scores
Popular rerankers:
- Cohere Rerank
- BGE Reranker
- UNI-COIL
- MonoT5
3. Query Transformation
Techniques to improve the query before retrieval:
-
Query Expansion:
- Add related terms to the query
- Example: "car" → "car automobile vehicle"
-
HyDE (Hypothetical Document Embeddings):
- Use LLM to generate a hypothetical perfect document
- Embed this document as the query
-
Multi-Query Retrieval:
- Generate multiple perspectives on the query
- Combine retrieval results
- Increases recall at the cost of more processing
Prompt Engineering for RAG
Constructing Effective Prompts
The prompt structure for RAG typically includes:
- System Instructions: Define the role and behavior of the assistant
- Retrieved Context: External knowledge from vector search
- User Query: The original question or instruction
- Response Format: Structure for the model's output
Example RAG Prompt Template
def create_rag_prompt(query, context_docs, system_instruction=None): """ Create a RAG prompt with retrieved context. Args: query: User's query context_docs: Retrieved documents/passages system_instruction: Optional system instruction Returns: Formatted prompt for the LLM """ # Default system instruction if not provided if system_instruction is None: system_instruction = ( "You are a helpful, accurate assistant. " "Use the provided context to answer the user's question. " "If the answer cannot be determined from the context, say 'I don't have enough information to answer this question.' " "Always cite your sources by referring to the document ID [docX] for any information you use." ) # Format retrieved documents formatted_context = " ".join([f"[doc{i}] {doc.text}" for i, doc in enumerate(context_docs)]) # Construct full prompt prompt = f"""{system_instruction} Context information: {formatted_context} User Question: {query} Your response: """ return prompt
Implementing a Basic RAG System
Setting Up a RAG Pipeline
Let's implement a complete RAG system using popular libraries:
import os from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.chains import RetrievalQA from langchain.document_loaders import DirectoryLoader, TextLoader from langchain.llms import OpenAI # Set up environment os.environ["OPENAI_API_KEY"] = "sk-your-api-key" # Replace with your API key # 1. Document Loading def load_documents(directory_path): loader = DirectoryLoader(directory_path, glob="**/*.txt", loader_cls=TextLoader) documents = loader.load() print(f"Loaded {len(documents)} documents") return documents # 2. Document Chunking def chunk_documents(documents): text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=100, length_function=len ) chunks = text_splitter.split_documents(documents) print(f"Created {len(chunks)} chunks") return chunks # 3. Create Vector Store def create_vector_store(chunks): embeddings = OpenAIEmbeddings() vector_store = Chroma.from_documents(documents=chunks, embedding=embeddings) print("Vector store created") return vector_store # 4. Set up Retrieval QA Chain def setup_qa_chain(vector_store): retriever = vector_store.as_retriever(search_kwargs={"k": 4}) qa_chain = RetrievalQA.from_chain_type( llm=OpenAI(), chain_type="stuff", retriever=retriever ) return qa_chain # 5. Answer Questions def answer_question(qa_chain, query): response = qa_chain({"query": query}) return response["result"] # Main Pipeline def run_rag_pipeline(directory_path, query): documents = load_documents(directory_path) chunks = chunk_documents(documents) vector_store = create_vector_store(chunks) qa_chain = setup_qa_chain(vector_store) answer = answer_question(qa_chain, query) return answer # Example usage if __name__ == "__main__": result = run_rag_pipeline("./documents", "What are the key features of RAG systems?") print(result)
More Sophisticated RAG Implementation
Here's a more advanced implementation with reranking:
import os from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.document_loaders import DirectoryLoader, TextLoader, PDFMinerLoader from langchain.llms import OpenAI from langchain.chat_models import ChatOpenAI from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import CohereRerank from langchain.retrievers.multi_query import MultiQueryRetriever from langchain.chains import RetrievalQA from langchain.prompts import PromptTemplate # Set up environment (replace with your API keys) os.environ["OPENAI_API_KEY"] = "sk-your-api-key" os.environ["COHERE_API_KEY"] = "your-cohere-key" # Define document loaders for different file types def get_loader_for_file(file_path): if file_path.endswith(".pdf"): return PDFMinerLoader(file_path) else: return TextLoader(file_path) # 1. Enhanced Document Loading def load_documents(directory_path): loader = DirectoryLoader( directory_path, glob="**/*.*", loader_cls=lambda file_path: get_loader_for_file(file_path) ) documents = loader.load() print(f"Loaded {len(documents)} documents") return documents # 2. Improved Document Chunking def chunk_documents(documents): text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=[" ", " ", " ", ""], length_function=len ) chunks = text_splitter.split_documents(documents) print(f"Created {len(chunks)} chunks") return chunks # 3. Create Vector Store def create_vector_store(chunks): embeddings = OpenAIEmbeddings() vector_store = Chroma.from_documents(documents=chunks, embedding=embeddings) print("Vector store created") return vector_store # 4. Set up Advanced Retrieval with Reranking def setup_advanced_retriever(vector_store): # First-stage retrieval (over-retrieve) base_retriever = vector_store.as_retriever(search_kwargs={"k": 20}) # Add reranking for improved precision compressor = CohereRerank() reranking_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=base_retriever ) # Add query expansion for improved recall llm = ChatOpenAI(temperature=0) multi_query_retriever = MultiQueryRetriever.from_llm( retriever=reranking_retriever, llm=llm ) return multi_query_retriever # 5. Custom RAG Prompt Template def create_rag_prompt_template(): template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise. Question: {question} Context: {context} Answer:""" return PromptTemplate( template=template, input_variables=["context", "question"] ) # 6. Setup QA Chain with Custom Prompt def setup_qa_chain(retriever): prompt = create_rag_prompt_template() qa_chain = RetrievalQA.from_chain_type( llm=ChatOpenAI(temperature=0, model="gpt-4"), chain_type="stuff", retriever=retriever, chain_type_kwargs={"prompt": prompt} ) return qa_chain # 7. Answer Questions def answer_question(qa_chain, query): response = qa_chain({"query": query}) return response["result"] # Main Pipeline def run_advanced_rag_pipeline(directory_path, query): documents = load_documents(directory_path) chunks = chunk_documents(documents) vector_store = create_vector_store(chunks) retriever = setup_advanced_retriever(vector_store) qa_chain = setup_qa_chain(retriever) answer = answer_question(qa_chain, query) return answer
RAG Evaluation and Optimization
Evaluating RAG System Performance
Effective RAG evaluation should consider multiple dimensions:
-
Retrieval Metrics:
- Precision: Are retrieved documents relevant?
- Recall: Are all relevant documents retrieved?
- Mean Average Precision (MAP): Ranking quality
-
Generation Quality Metrics:
- Faithfulness: Does output align with retrieved information?
- Answer Relevance: Does output address the query?
- Groundedness: Is the output supported by evidence?
-
End-to-End Metrics:
- Correctness: Is the final answer factually correct?
- Helpfulness: Does it solve the user's problem?
- Latency: Is retrieval + generation time acceptable?
from ragas.metrics import ( faithfulness, answer_relevancy, context_relevancy, context_recall, context_precision ) from ragas.langchain import RagasEvaluatorChain from datasets import Dataset # Example evaluation data eval_data = [ { "query": "What are the key components of RAG?", "contexts": ["RAG consists of retrieval and generation components...", "..."], "answer": "The key components of RAG are the retriever, which finds relevant documents, and the generator, which produces answers based on retrieved context.", "ground_truth": "RAG systems comprise retrieval mechanisms that find documents and generation components that produce answers." }, # More evaluation examples ] # Convert to HuggingFace dataset dataset = Dataset.from_list(eval_data) # Create evaluation chains faithfulness_chain = RagasEvaluatorChain(metric=faithfulness) answer_relevancy_chain = RagasEvaluatorChain(metric=answer_relevancy) context_relevancy_chain = RagasEvaluatorChain(metric=context_relevancy) context_recall_chain = RagasEvaluatorChain(metric=context_recall) context_precision_chain = RagasEvaluatorChain(metric=context_precision) # Example evaluation function def evaluate_rag_system(dataset): results = { "faithfulness": [], "answer_relevancy": [], "context_relevancy": [], "context_recall": [], "context_precision": [] } for item in dataset: # Evaluate each metric faith_result = faithfulness_chain.run( query=item["query"], answer=item["answer"], contexts=item["contexts"] ) results["faithfulness"].append(faith_result["score"]) # Similar for other metrics... # Calculate average scores avg_results = {k: sum(v)/len(v) for k, v in results.items()} return avg_results
Optimizing RAG Performance
Interactive Chunking Strategy Optimization
Understanding the optimal chunk size is crucial for RAG performance. Let's explore how different chunk sizes affect retrieval precision, answer quality, and processing speed: