Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is an AI framework that enhances the accuracy and reliability of Large Language Models (LLMs) by integrating external knowledge sources during the generation process.
Retrieval-Augmented Generation (RAG) is a sophisticated AI architecture designed to enhance the capabilities of Large Language Models (LLMs) by grounding their responses in external, up-to-date knowledge. Instead of relying solely on the knowledge encoded within the LLM's parameters during training, RAG systems first retrieve relevant information from a specified knowledge base (e.g., a vector [database](/en/terms/vector-database), document repository, or API) before generating a response. The process typically involves several key stages: 1. Query Transformation: The user's input query is processed and potentially rephrased to optimize retrieval. 2. Information Retrieval: The transformed query is used to search the external knowledge base for the most relevant documents or data chunks. This often employs vector embeddings and similarity search algorithms. 3. Context Augmentation: The retrieved information is then combined with the original user query to form an augmented prompt. 4. LLM Generation: This augmented prompt is fed into the LLM, which uses both its internal knowledge and the provided context to generate a more accurate, informed, and contextually relevant response. RAG addresses the limitations of LLMs, such as knowledge cutoffs, hallucination, and the inability to access real-time information. It allows LLMs to leverage dynamic and specific data without requiring costly retraining. Trade-offs include increased latency due to the retrieval step and the complexity of managing and indexing the external knowledge base. The effectiveness of RAG heavily depends on the quality of the retrieval system and the relevance of the retrieved documents.
graph LR
Center["Retrieval-Augmented Generation (RAG)"]:::main
Pre_large_language_model["large-language-model"]:::pre --> Center
click Pre_large_language_model "/terms/large-language-model"
Pre_vector_database["vector-database"]:::pre --> Center
click Pre_vector_database "/terms/vector-database"
Pre_semantic_search["semantic-search"]:::pre --> Center
click Pre_semantic_search "/terms/semantic-search"
Center --> Child_rag_pipeline["rag-pipeline"]:::child
click Child_rag_pipeline "/terms/rag-pipeline"
Center --> Child_context_window["context-window"]:::child
click Child_context_window "/terms/context-window"
Rel_prompt_engineering["prompt-engineering"]:::related -.-> Center
click Rel_prompt_engineering "/terms/prompt-engineering"
Rel_generative_ai_agents["generative-ai-agents"]:::related -.-> Center
click Rel_generative_ai_agents "/terms/generative-ai-agents"
Rel_rlhf["rlhf"]:::related -.-> Center
click Rel_rlhf "/terms/rlhf"
classDef main fill:#7c3aed,stroke:#8b5cf6,stroke-width:2px,color:white,font-weight:bold,rx:5,ry:5;
classDef pre fill:#0f172a,stroke:#3b82f6,color:#94a3b8,rx:5,ry:5;
classDef child fill:#0f172a,stroke:#10b981,color:#94a3b8,rx:5,ry:5;
classDef related fill:#0f172a,stroke:#8b5cf6,stroke-dasharray: 5 5,color:#94a3b8,rx:5,ry:5;
linkStyle default stroke:#4b5563,stroke-width:2px;
🧒 Explain Like I'm 5
🌍 Imagine you're taking a test. A regular AI tries to answer from memory alone. An AI using [RAG](/en/terms/rag) is allowed to look at an open book (the [database](/en/terms/database)) specifically related to the question before it writes down the answer. This makes it much less likely to make things up.
🤓 Expert Deep Dive
RAG architectures fundamentally shift LLM interaction from pure parametric recall to a hybrid parametric-retrieval paradigm. The core challenge lies in optimizing the retrieval-augmentation loop for relevance and efficiency. Techniques like dense passage retrieval (DPR) using bi-encoders, or hybrid search combining keyword and semantic matching, are crucial. Advanced RAG implementations explore iterative retrieval, query decomposition for complex questions, and re-ranking mechanisms to refine retrieved context. The choice of vector [database](/en/terms/vector-database), embedding model, chunking strategy, and retrieval parameters (e.g., top-k) significantly impacts performance. Potential vulnerabilities include retrieval poisoning, where malicious data injected into the knowledge base can lead to biased or incorrect LLM outputs. Furthermore, the computational overhead of retrieval can introduce latency, a critical trade-off for real-time applications. Evaluating RAG systems requires metrics beyond standard LLM benchmarks, focusing on retrieval precision/recall and the factual consistency of the generated output with the retrieved context.