BIG-Bench Hard
The BIG-Bench Hard (BBH) benchmark comprises 23 challenging BIG-Bench tasks where prior language model evaluations have not outperformed the average human rater. BBH evaluates models using both few-shot and chain-of-thought (CoT) prompting techniques. For more details, you can visit the BIG-Bench Hard GitHub page.
Arguments
There are three optional arguments when using the BigBenchHard
benchmark:
- [Optional]
tasks
: a list of tasks (BigBenchHardTask
enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list ofBigBenchHardTask
enums can be found here. - [Optional]
n_shots
: the number of "shots" to use for few-shot learning. This number ranges strictly from 0-3, and is set to 3 by default. - [Optional]
enable_cot
: a boolean that determines if CoT prompting is used for evaluation. This is set toTrue
by default.
Chain-of-Thought (CoT) prompting is an approach where the model is prompted to articulate its reasoning process to arrive at an answer. Meanwhile, few-shot prompting is a method where the model is provided with a few examples (or "shots") to learn from before making predictions. When combined, few-shot prompting and CoT can significantly enhance performance. You can learn more about CoT here.
Example
The code below assesses a custom mistral_7b
model (click here to learn how to use ANY custom LLM) on Boolean Expressions and Causal Judgement in BigBenchHard
using 3-shot CoT prompting.
from deepeval.benchmarks import BigBenchHard
from deepeval.benchmarks.tasks import BigBenchHardTask
# Define benchmark with specific tasks and shots
benchmark = BigBenchHard(
tasks=[BigBenchHardTask.BOOLEAN_EXPRESSIONS, BigBenchHardTask.CAUSAL_JUDGEMENT],
n_shots=3,
enable_cot=True
)
# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)
The overall_score
for this benchmark ranges from 0 to 1, which is the proportion of total correct predictions according to the target labels for each respective task. The exact match scorer is used for BIG-Bench Hard.
BBH answers exhibit a greater variety of answers compared to benchmarks that use multiple-choice questions, since different tasks in BBH require different types of outputs (for example, boolean values in boolean expression tasks versus numbers in arithmetic tasks). To enhance benchmark performance, employing CoT prompting will prove to be extremely helpful.
Utilizing more few-shot examples (n_shots
) can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall score.
BIG-Bench Hard Tasks
The BigBenchHardTask
enum classifies the diverse range of tasks covered in the BIG-Bench Hard benchmark.
from deepeval.benchmarks.tasks import BigBenchHardTask
big_tasks = [BigBenchHardTask.BOOLEAN_EXPRESSIONS]
Below is the comprehensive list of available tasks:
BOOLEAN_EXPRESSIONS
CAUSAL_JUDGEMENT
DATE_UNDERSTANDING
DISAMBIGUATION_QA
DYCK_LANGUAGES
FORMAL_FALLACIES
GEOMETRIC_SHAPES
HYPERBATON
LOGICAL_DEDUCTION_FIVE_OBJECTS
LOGICAL_DEDUCTION_SEVEN_OBJECTS
LOGICAL_DEDUCTION_THREE_OBJECTS
MOVIE_RECOMMENDATION
MULTISTEP_ARITHMETIC_TWO
NAVIGATE
OBJECT_COUNTING
PENGUINS_IN_A_TABLE
REASONING_ABOUT_COLORED_OBJECTS
RUIN_NAMES
SALIENT_TRANSLATION_ERROR_DETECTION
SNARKS
SPORTS_UNDERSTANDING
TEMPORAL_SEQUENCES
TRACKING_SHUFFLED_OBJECTS_FIVE_OBJECTS
TRACKING_SHUFFLED_OBJECTS_SEVEN_OBJECTS
TRACKING_SHUFFLED_OBJECTS_THREE_OBJECTS
WEB_OF_LIES
WORD_SORTING