Summarization

The summarization metric uses LLMs to determine whether your LLM (application) is generating factually correct summaries while including the neccessary details from the original text. In a summarization task within deepeval, the original text refers to the input while the summary is the actual_output.

note

The SummarizationMetric is the only default metric in deepeval that is not cacheable.

Required Arguments

To use the SummarizationMetric, you'll have to provide the following arguments when creating an LLMTestCase:

input
actual_output

Example

Let's take this input and actual_output as an example:

# This is the original text to be summarized
input = """
The 'coverage score' is calculated as the percentage of assessment questions
for which both the summary and the original document provide a 'yes' answer. This
method ensures that the summary not only includes key information from the original
text but also accurately represents it. A higher coverage score indicates a
more comprehensive and faithful summary, signifying that the summary effectively
encapsulates the crucial points and details from the original content.
"""

# This is the summary, replace this with the actual output from your LLM application
actual_output="""
The coverage score quantifies how well a summary captures and
accurately represents key information from the original text,
with a higher score indicating greater comprehensiveness.
"""

You can use the SummarizationMetric as follows:

from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(input=input, actual_output=actual_output)
metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4",
    assessment_questions=[
        "Is the coverage score based on a percentage of 'yes' answers?",
        "Does the score ensure the summary's accuracy with the source?",
        "Does a higher score mean a more comprehensive summary?"
    ]
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

There are seven optional parameters when instantiating an SummarizationMetric class:

[Optional] threshold: the passing threshold, defaulted to 0.5.
[Optional] assessment_questions: a list of close-ended questions that can be answered with either a 'yes' or a 'no'. These are questions you want your summary to be able to ideally answer, and is especially helpful if you already know what a good summary for your use case looks like. If assessment_questions is not provided, we will generate a set of assessment_questions for you at evaluation time. The assessment_questions are used to calculate the coverage_score.
[Optional] n: the number of assessment questions to generate when assessment_questions is not provided. Defaulted to 5.
[Optional] model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to 'gpt-4o'.
[Optional] include_reason: a boolean which when set to True, will include a reason for its evaluation score. Defaulted to True.
[Optional] strict_mode: a boolean which when set to True, enforces a strict evaluation criterion. In strict mode, the metric score becomes binary: a score of 1 indicates a perfect result, and any outcome less than perfect is scored as 0. Defaulted as False.
[Optional] async_mode: a boolean which when set to True, enables concurrent execution within the measure() method. Defaulted to True.

How Is It Calculated?

The SummarizationMetric score is calculated according to the following equation:

\text{Summarization} = \min(\text{Alignment Score}, \text{Coverage Score})

To break it down, the:

alignment_score determines whether the summary contains hallucinated or contradictory information to the original text.
coverage_score determines whether the summary contains the neccessary information from the original text.

While the alignment_score is similar to that of the HallucinationMetric, the coverage_score is first calculated by generating n closed-ended questions that can only be answered with either a 'yes or a 'no', before calculating the ratio of which the original text and summary yields the same answer. Here is a great article on how deepeval's summarization metric was build.

You can access the alignment_score and coverage_score from a SummarizationMetric as follows:

from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(...)
metric = SummarizationMetric(...)

metric.measure(test_case)
print(metric.score)
print(metric.reason)
print(metric.score_breakdown)

note

Since the summarization score is the minimum of the alignment_score and coverage_score, a 0 value for either one of these scores will result in a final summarization score of 0.

Required Arguments​

Example​

How Is It Calculated?​

Required Arguments

Example

How Is It Calculated?