AmplifyIntelligence.ai
Posts
How I Set Up DeepEval for Fast, Easy, and Powerful LLM Evaluations

How I Set Up DeepEval for Fast, Easy, and Powerful LLM Evaluations

The Simple Tutorial for LLM Evals Using DeepEval

Amplify Intelligence
August 21, 2024

Today, I want to walk you through one of the most crucial (and often overlooked) aspects of deploying LLMs in production: evaluations. I'll show you how to set up DeepEval, a popular open-source evaluation framework for large language models.

Why DeepEval?

DeepEval is my go-to framework for LLM evaluations because:

It's open-source and highly customizable
It comes with a variety of built-in evaluation methods
You can easily write your own evaluation metrics
It offers a web UI for in-depth analysis (optional, but super useful)

Let me walk you through how I set it up and ran my first evaluation.

Setting Up the Environment

Here's how I got started:

Created a new folder for the project: mkdir DeepEvalTest && cd DeepEvalTest

Set up a Python virtual environment:

python -m venv venv source venv/bin/activate

Installed DeepEval: pip install deepeval
(Optional) Logged into the DeepEval web UI: deepeval login

Creating Our First Evaluation

I created a simple Python script (test_example.py) to run our first eval:

from deepeval import assert_test from deepeval.metrics import AnswerRelevancy test_case = { "input": "What if these shoes don't fit?", "actual_output": "We offer a 30-day full refund on all shoe purchases at no extra cost." } metric = AnswerRelevancy(threshold=0.5) assert_test(metric, test_case)

This script uses the AnswerRelevancy metric to check if the LLM's response is relevant to the user's question. Pretty neat, right?

The Crucial Step You Might Miss

Here's a pro tip: Before running your eval, make sure to set up your OpenAI API key as an environment variable. I did this by running:

export OPENAI_API_KEY=your_api_key_here

Trust me, this step will save you from a headache later!

Running the Evaluation

With everything set up, I ran the evaluation using:

deepeval test run test_example.py

And voila! The eval ran successfully, giving me a score and even telling me how many tokens it used.

Diving Deeper with the Web UI

Remember that optional login step? Here's where it pays off. By logging into confident.ai.com, I got access to a dashboard that shows:

Detailed test results
Cost per test
Input and output for each test case
Time taken for each evaluation

This feature is especially handy when you're running bulk tests or need to do more in-depth analysis.

Why This Matters

Setting up a robust evaluation framework like DeepEval is crucial when you're deploying LLMs at scale. It allows you to:

Consistently measure the performance of your models
Quickly identify areas for improvement in your prompts
Ensure the quality of your LLM outputs in production

Wrapping Up

Getting started with DeepEval might seem a bit tricky at first, but trust me, it's worth it. The insights you gain from these evaluations can be game-changing for your LLM applications.

I hope this walkthrough helps you get up and running with DeepEval. Remember, the key steps are:

Set up your environment
Install DeepEval
Create your evaluation script
Set your OpenAI API key
Run the evaluation
Analyze the results (bonus points for using the web UI!)

If you want to dive deeper into LLM evaluations or have any questions about setting up DeepEval, feel free to reach out. You can find me on Twitter or check out my YouTube channel for more tutorials like this one.

Happy evaluating!

Reply

or to participate.