How I Set Up DeepEval for Fast, Easy, and Powerful LLM Evaluations

The Simple Tutorial for LLM Evals Using DeepEval

Today, I want to walk you through one of the most crucial (and often overlooked) aspects of deploying LLMs in production: evaluations. I'll show you how to set up DeepEval, a popular open-source evaluation framework for large language models.

Why DeepEval?

DeepEval is my go-to framework for LLM evaluations because:

  • It's open-source and highly customizable

  • It comes with a variety of built-in evaluation methods

  • You can easily write your own evaluation metrics

  • It offers a web UI for in-depth analysis (optional, but super useful)

Let me walk you through how I set it up and ran my first evaluation.

Setting Up the Environment

Here's how I got started:

  1. Created a new folder for the project: mkdir DeepEvalTest && cd DeepEvalTest

  2. Set up a Python virtual environment:

    python -m venv venv source venv/bin/activate
  3. Installed DeepEval: pip install deepeval

  4. (Optional) Logged into the DeepEval web UI: deepeval login

Creating Our First Evaluation

I created a simple Python script (test_example.py) to run our first eval:

from deepeval import assert_test from deepeval.metrics import AnswerRelevancy test_case = { "input": "What if these shoes don't fit?", "actual_output": "We offer a 30-day full refund on all shoe purchases at no extra cost." } metric = AnswerRelevancy(threshold=0.5) assert_test(metric, test_case)

This script uses the AnswerRelevancy metric to check if the LLM's response is relevant to the user's question. Pretty neat, right?

The Crucial Step You Might Miss

Here's a pro tip: Before running your eval, make sure to set up your OpenAI API key as an environment variable. I did this by running:

export OPENAI_API_KEY=your_api_key_here

Trust me, this step will save you from a headache later!

Running the Evaluation

With everything set up, I ran the evaluation using:

deepeval test run test_example.py

And voila! The eval ran successfully, giving me a score and even telling me how many tokens it used.

Diving Deeper with the Web UI

Remember that optional login step? Here's where it pays off. By logging into confident.ai.com, I got access to a dashboard that shows:

  • Detailed test results

  • Cost per test

  • Input and output for each test case

  • Time taken for each evaluation

This feature is especially handy when you're running bulk tests or need to do more in-depth analysis.

Why This Matters

Setting up a robust evaluation framework like DeepEval is crucial when you're deploying LLMs at scale. It allows you to:

  • Consistently measure the performance of your models

  • Quickly identify areas for improvement in your prompts

  • Ensure the quality of your LLM outputs in production

Wrapping Up

Getting started with DeepEval might seem a bit tricky at first, but trust me, it's worth it. The insights you gain from these evaluations can be game-changing for your LLM applications.

I hope this walkthrough helps you get up and running with DeepEval. Remember, the key steps are:

  1. Set up your environment

  2. Install DeepEval

  3. Create your evaluation script

  4. Set your OpenAI API key

  5. Run the evaluation

  6. Analyze the results (bonus points for using the web UI!)

If you want to dive deeper into LLM evaluations or have any questions about setting up DeepEval, feel free to reach out. You can find me on Twitter or check out my YouTube channel for more tutorials like this one.

Happy evaluating!

Reply

or to participate.