Custom Metrics
deepeval
allows you to implement your own evaluator (for example, your own GPT evaluator) by creating a custom metric. All custom metrics are automatically integrated with the deepeval ecosystem, which includes Confident AI.
Required Arguments
To use a custom metric, you'll have to provide the following arguments when creating an LLMTestCase
:
input
actual_output
You'll also need to supply any additional arguments such as expected_output
and context
if your custom metric's measure()
method is dependent on these parameters.
Implementation
To create a custom metric, you'll need to inherite deepeval
's BaseMetric
class, and implement abstract methods and properties such as measure()
, is_successful()
, and name()
. Here's an example:
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase
# Inherit BaseMetric
class LatencyMetric(BaseMetric):
# This metric by default checks if the latency is greater than 10 seconds
def __init__(self, max_seconds: int=10):
self.threshold = max_seconds
def measure(self, test_case: LLMTestCase):
# Set self.success and self.score in the "measure" method
self.success = test_case.additional_metadata["latency"] <= self.threshold
if self.success:
self.score = 1
else:
self.score = 0
# You can also optionally set a reason for the score returned.
# This is particularly useful for a score computed using LLMs
self.reason = "Too slow!"
return self.score
async def a_measure(self, test_case: LLMTestCase):
return self.measure(test_case)
def is_successful(self):
return self.success
@property
def __name__(self):
return "Latency"
Notice that a few things has happened:
self.threshold
was set in__init__()
, and this can be either a minimum or maximum thresholdself.success
,self.score
, andself.reason
was set inmeasure()
measure()
takes in anLLMTestCase
is_successful()
simply returns the success status__name()__
simply returns a string representing the metric name
To create a custom metric without unexpected errors, we recommend you set the appropriate class variables in the appropriate methods as outlined above. You should also note that self.reason
is optional. self.reason
should be a string representing the rationale behind an LLM computed score. This is only applicable if you're using LLMs as an evaluator in the measure()
method, and has implemented a way to generate a score reasoning.
After creating a custom metric, you can use it in the same way as all of deepeval
's metrics:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
...
# Note that we pass in latency since the measure method requires it for evaluation
latency_metric = LatencyMetric(max_seconds=10.0)
test_case = LLMTestCase(input="...", actual_output="...", additional_metadata={"latency": 8.5})
evaluate([test_case], [latency_metric])