Retrieval Augmented Generation: What It Is and How to Start Using It

A staircase leading up past bookshelves lining the walls of multiple floors
Sean Moriarity

Machine Learning Advisor

Sean Moriarity

From machine learning to a scale-proof digital product, Elixir can help you achieve your goals. Book a free consult to learn more.

Introduction

If you’ve spent any time in working with Large Language Models (LLMs) over the past two years, you’ve almost certainly heard of Retrieval Augmented Generation (RAG). RAG combines the strengths of information retrieval with the language capabilities of LLMs. RAG as a concept is relatively simple to understand: You retrieve relevant context from some corpus, inject the context into the prompt to your LLM, and have the LLM generate output based on your prompt. Despite its simplicity, designing good RAG solutions remains a challenge. In this post, you’ll learn about RAG in detail, and discover some solutions to level up your RAG game.

Understanding Retrieval Augmented Generation

The purpose of RAG is to ground a language model with factual information–taking advantage of the LLM’s ability to learn information in context. Generally speaking, there are two approaches to “specializing” LLMs for specific tasks:

  1. Fine-tuning
  2. In-context learning

Fine-tuning involves actually updating a model’s weights by training on domain-specific data. As LLMs have improved over time, fine-tuning has started to make less and less sense. There are a few drawbacks to fine-tuning:

  • It is data and compute hungry. Getting enough of both can be very expensive.
  • It can lead to model collapse. The model will almost certainly lose some generality.
  • It can be difficult, and it might not lead to better results!

It pains me to admit that fine-tuning might not be worth it. Training models is fun, and I think it’s a skill everybody should at least familiarize themselves with. However, as LLMs continue to improve at in-context learning, it makes more sense to invest your resources elsewhere than in training specialized models.

This brings us to in-context learning. In-context learning refers to the ability of LLMs to learn and adapt to new tasks or information solely based on the input provided during inference time, without the need for explicit fine-tuning or retraining. This is achieved by leveraging the vast amount of knowledge and patterns that LLMs acquired during their pre-training phase on massive text corpora. In-context learning generally manifests itself in the form of prompt engineering, or by providing few-shot examples to the LLM at inference time.

RAG is a form of in-context in which a model uses provided relevant context to generate a response. RAG is arguably the most popular LLM use case. There are numerous companies whose sole value proposition is RAG over proprietary data. RAG is necessary because of the limited context length of LLMs. The RAG process typically looks something like this:

  1. Query pre-processing: A typical RAG pipeline will convert an input query into a vector representation using an embedding model to perform vector search. This step might also include more sophisticated techniques such as query expansion.
  2. Retrieval: Retrieval is the process of taking the input query and finding relevant documents. The actual retrieval step is typically a vector search, though it may also be a hybrid approach that uses extracted metadata from the input query to perform metadata filtering. For simplicity, I’m combining more sophisticated pipelines which may make use of re-rankers into this step as well.
  3. Prompting: After the relevant context is retrieved, it is turned into a prompt for LLM consumption.
  4. Generation: The prompt is then passed to an LLM to generate a completion.

This process is simple. However, implementing an effective RAG solution is challenging. Everybody who’s worked with LLMs has implemented a RAG pipeline. But not everybody has implemented a good RAG pipeline.

A Simple RAG Implementation

A simple RAG implementation typically consists of:

  1. An embedding model
  2. A vector database
  3. An LLM

Embedding Model

In Elixir, a typical RAG pipeline might use an OSS embedding model with Bumblebee:

defmodule RAG.Embedding do
  
  def serving() do
    {:ok, model} = Bumblebee.load_model({:hf, "intfloat/e5-large-v2"})
    {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "intfloat/e5-large-v2"})

    Bumblebee.Text.text_embedding(model, tokenizer,
      embedding_processor: :l2_norm,
      defn_options: [compiler: EXLA]
    )
  end

  def predict(text) do
    %{embedding: vector} = Nx.Serving.batched_run(__MODULE__, text)
  end
end

The purpose of an embedding model is to turn a query into a vector representation that captures the semantic meaning of the input query. These models learn to map words, sentences, or even entire documents to fixed-size vectors that capture the semantic and syntactic information present in the text. By representing text as vectors, embedding models allow for efficient similarity comparisons, clustering, and retrieval of textual data.

In a vector-based RAG solution, you will typically collect and process a large corpus of data into vectors ahead of time and store them in a vector database. If you’ve implemented a RAG solution before, you’ll know this is usually the most time consuming process. Often times proprietary data is in non-text-based formats such as HTML, word documents, PDFs, audio, and more. Additionally, these documents are unstructured, and breaking them up into consumable chunks is challenging. The old machine learning cliche holds true here: 99% of your time doing machine learning is spent working on data.

One benefit of using an Elixir embedding model rather than an online solution is latency. If you’re working with Elixir and hitting an external endpoint, you will typically get embeddings back as JSON. Encoding/decoding large lists of floats from a JSON payload in Elixir can be a bottleneck. If you keep your embedding model in Elixir, you skip this process and can see some performance boosts.

Vector Database

While there are specialized vector databases that promise ultra-fast and high-recall retrieval, I typically prefer solutions that make use of Postgres and pgvector. pgvector is a Postgres extension that adds support for vector columns and approximate indexes for performing vector search. It can be slower than specialized vector databases, but it simplifies your stack a bit as you can store your documents alongside the rest of your data. In Elixir, the process of using pgvector is as simple as installing the extension and Elixir library, and adding a vector column to your database:

## Migration
create table(:items) do
  add :embedding, :vector, size: 3
end

create index("items", ["embedding vector_l2_ops"], using: :ivfflat, options: "lists = 100")

## Schema
schema "items" do
  field :embedding, Pgvector.Ecto.Vector
end

And then you can perform vector search using traditional Ecto queries:

import Ecto.Query
import Pgvector.Ecto.Query

Repo.all(from i in Item, order_by: l2_distance(i.embedding, [1, 2, 3]), limit: 5)

There are other approaches you might consider using in Elixir such as FAISS or HNSW. Additionally, there are interesting projects to watch in this space such as pgvecto.rs, which support 1-bit vector fields which can lead to a significant database size reduction and speed-up with limited loss in performance.

Large Language Model

The final component of a typical RAG pipeline is the language model. There are several options in Elixir here. You can use an open-source models with Bumblebee:

defmodule RAG.Generation do

  def serving() do
    repo = {:hf, "mistralai/Mistral-7B-Instruct-v0.2"}

    {:ok, model_info} = Bumblebee.load_model(repo, type: :bf16, backend: EXLA.Backend)
    {:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
    {:ok, generation_config} = Bumblebee.load_generation_config(repo)

    Bumblebee.Text.generation(model_info, tokenizer, generation_config,
      compile: [batch_size: 1, sequence_length: 1028],
      stream: true,
      defn_options: [compiler: EXLA]
    )
  end

  def predict(prompt) do
    Nx.Serving.batched_run(__MODULE__, prompt)
  end
end

Or you can use one of the LLM providers such as OpenAI:

defmodule RAG.Generation do
  
  def predict(prompt) do
    OpenAI.chat_completion(
      model: "gpt-3.5-turbo",
      messages: [%{role: "user", content: prompt}]
    )
  end
end

Regardless of the model you use, your RAG pipeline will end up looking something like:

defmodule RAG do
  
  def generate(query) do
    embedding = RAG.Embedding.predict(query)
    context = RAG.Retrieval.retrieve(embedding)

    prompt = format_prompt(context)
    RAG.Generation.predict(prompt)
  end

  defp format_prompt(context) do
    """
    Use the following context to respond to the following query.
    Context:
    #{Enum.map_join(context, "\n", & &1)}
    Query: #{query}
    """
  end
end

This is the simplest RAG pipeline you will find, and it works pretty well for simple use cases. However, if you have a more advanced use case, you need to get more sophisticated.

Improving your RAG Pipelines

Query Expansion

One drawback of using a pure vector-based search is that your users might not provide enough context in their query to extract relevant semantic information. If you think about how people typically search for things on Google, it’s very keyword-based. When interacting with LLMs, people are more verbose, but their input queries may be terse and lacking key context. To address this missing context, some pipelines opt to use query expansion. Query expansion is a powerful technique for improving the relevance of the retrieved passages. The goal of query expansion is to reformulate the original query by adding related terms or phrases, thereby increasing the chances of finding relevant information in the knowledge base.

Query expansion can be done in several ways. One technique is to use synonym expansion. This approach uses a word list to expand an input query based on synonyms. A more common approach in the world of LLMs is to do query expansion using an LLM. One way to do this is by using your LLM to generate a hypothetical answer before performing retrieval. This increases the semantic surface area of the input before providing it to an embedding model. This might look something like:

defmodule RAG do

  def generate(query) do
    expanded_query = expand(query)
    embedding = RAG.Embedding.predict(expanded_query)
    context = RAG.Retrieval.retrieve(embedding)

    prompt = format_prompt(context)
    RAG.Generation.predict(prompt)
  end

  defp expand(query) do
    answer = RAG.Generation.predict("Provide a hypothetical response to this: #{query}")
    "Query: #{query}\nHypothetical Answer: #{answer}"
  end

  defp format_prompt(context) do
    """
    Use the following context to respond to the following query.
    Context:
    #{Enum.map_join(context, "\n", & &1)}
    Query: #{query}
    """
  end
end

A drawback to query expansion with LLMs is that it may introduce additional unacceptable latency. If it’s acceptable for your use case; however, it may be worth it.

Re-ranking

For most serious RAG use cases re-ranking is an absolute must. Re-ranking aims to improve the quality and relevance of the generated responses by refining the order of the retrieved passages. After the initial retrieval step, where a set of candidate passages is obtained based on their similarity to the query, re-ranking techniques are applied to further prioritize the most informative and contextually relevant passages. The main objective of re-ranking is to assign higher scores to passages that are more likely to contribute to generating accurate and coherent responses, while demoting less relevant or noisy passages.

In a “simple” RAG implementation, you may attempt to use an embedding model to do retrieval in a single stage by retrieving the most relevant documents. Re-ranking introduces an additional stage. With re-ranking, you would typically retrieve more documents than before, and then use a re-ranking model to order documents by relevancy again. Then, you can choose to keep the most relevant re-ranked documents and discard the rest. Re-ranking can be done in several ways. The most popular way is using dedicated re-ranking models such as Cohere’s re-ranker or bge-reranker.

Another approach to re-ranking is to use an LLM to perform re-ranking for you. You can do this by providing an explicit prompt to the LLM, asking it to order retrieved chunks in order of relevancy to the input query. Large models are extremely capable at this task, but it may introduce additional overhead and latency to your generation pipeline.

Function Calling

In some cases, the best RAG is actually no RAG. Some more sophisticated pipelines use smaller “orchestration” models that determine whether or not to perform retrieval at all before generating a response. This can be helpful if a query does not require any specialized or proprietary knowledge to answer. Some pipelines may also make use of LLMs’ ability to extract structured information from an input query in order to generate a more advanced search. For example, you may use an LLM’s ability to extract information as a means of extracting metadata to perform filtering prior to performing a vector search. This can dramatically improve the performance of your retrieval pipeline.

Document Averaging

One unique approach I’ve seen to improve RAG performance is to make use of “document averaging.” Given a chunked document, you compute the mean of all chunks in the document. This gives you a vector that is an aggregate of the entire document. It’s a bit lossy; however, it gives you a decent idea of whether or not a document contains relevant information to a given query. Using the “average” vectors, you can quickly determine if a document might contain relevant information, and then perform an additional search on the chunks in the document. Depending on your data, this can be a useful way to quickly retrieve relevant documents before performing a more intensive search on the filtered contents.

Generating Citations

The most common issue with LLMs is hallucinations. Hallucinations is a term for when an LLM “makes up” information out of thin air. RAG should ground your LLM with factual content; however, it does not completely solve the hallucination issue. One way to fix this is to do a final pass on your model’s output to generate citations for the generated response based on your input context. The idea is that you use an LLM to look at the response and add citations after the fact given the context you provided the model. This additional pass adds confidence that a given response is rooted in factual or real content. There’s a great write up of doing this in Elixir in the Instructor repository

The Importance of Evals

One thing that’s true regardless of the software you are implementing is that it is never perfect on your first implementation. In traditional software development, we improve software over time. Typically, we add new features that are backed by requirements. In order to verify these features work, we add tests. Due to the non-deterministic nature of a RAG pipeline, it can be difficult/impossible to add deterministic tests. Additionally, small changes in the pipeline can have a drastic impact on the pipeline performance. Because of this, it is very important to introduce specific evaluations into your process as soon as possible. These evaluations should capture the use cases you expect to see in production, and will significantly help you iteratively improve your RAG pipelines.

Conclusion

RAG is the bread and butter of LLM-based applications. In this post, we covered what RAG is, and what it generally looks like in Elixir. We also discussed some advanced techniques for improving traditional RAG pipelines. Elixir makes it easy to build RAG applications without needing to drift outside of the ecosystem. If you’re planning on building an LLM-based application, I strongly recommend you check out Elixir. Until next time!

Newsletter

Stay in the Know

Get the latest news and insights on Elixir, Phoenix, machine learning, product strategy, and more—delivered straight to your inbox.

Narwin holding a press release sheet while opening the DockYard brand kit box