Building and Evaluating RAG Systems
Retrieval Augmented Generation (RAG) systems are arguably the simplest and most powerful tool in agent-based AI. Although widely used in production, their performance in terms of relevance, groundedness, and truthfulness is rarely assessed once in use.
In this workshop, you will learn a set of practical evaluation techniques for RAG systems, ranging from baseline offline metrics to reference-free evaluation for cases where 'the truth' is not known in advance, and finally production monitoring and A/B testing.
We will start with a simple RAG system and evaluate it using methods such as Haystack, covering toolsets such as RAGAS and concepts such as the RAG Triad.
We will then progress to evaluating with synthetic datasets. You will learn how to generate synthetic question/answer sets using LLM calls, and how to evaluate them with the help of rankers, as well as how to evaluate different embedding models and set cut-off parameters.
Next, we will evaluate the generation part of RAG. You will learn how to set up and use LLM judge methods.
Finally, we will explore data structures that enable the constant monitoring and A/B testing of production systems.
Learning outcomes: By the end, you will be able to:
- Diagnose RAG performance in terms of relevance, groundedness and truthfulness
- Build an evaluation harness for a simple RAG pipeline
- Use common toolkits (e.g. Haystack, RAGAS) and concepts (e.g. the RAG triad) to measure and debug failures
- Create and evaluate synthetic datasets when labelled data is scarce
- Compare embedding models, tune retrieval cutoffs and evaluate ranking choices
- Evaluate the generation step using LLM-as-judge methods
- Design data structures and workflows for continuous monitoring, regression testing and A/B experimentation in production
Prerequisites: A linux or mac laptop with Elixir and Postgres installed.
Target Audience
Developers interested in building and evaluating RAG systems in Elixir.