mmlu paper

Measuring Massive Multitask Language Understanding - arXiv arxiv.org › cs

7 сент. 2020 г. · We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, ...

MMLU Dataset | Papers With Code paperswithcode.com › dataset › mmlu

The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced ...

MMLU-Pro: A More Robust and Challenging Multi-Task ... - arXiv arxiv.org › cs

3 июн. 2024 г. · This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning- ...

hendrycks/test: Measuring Massive Multitask Language ... github.com › hendrycks › test

This is the repository for Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, ...

MMLU Benchmark (Multi-task Language Understanding) paperswithcode.com › sota › multi-task-languag...

The current state-of-the-art on MMLU is Claude 3.5 Sonnet (5-shot). See a full comparison of 116 papers with code.

Measuring Massive Multitask Language Understanding openreview.net › forum

12 янв. 2021 г. · We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, ...

Massive Multitask Language Understanding (MMLU) on HELM crfm.stanford.edu › helm › mmlu › latest

Massive Multitask Language Understanding (MMLU) (Hendrycks et al, 2020) is a multiple-choice question answering test that covers 57 tasks including ...

MMLU - Wikipedia en.wikipedia.org › wiki › MMLU

In artificial intelligence, Measuring Massive Multitask Language Understanding (MMLU) is a benchmark for evaluating the capabilities of large language ...

Measuring Massive Multitask Language Understanding huggingface.co › papers

7 сент. 2020 г. · We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, ...

Massive Multitask Language Understanding (MMLU) on HELM crfm.stanford.edu › 2024/05/01 › helm-mmlu

1 мая 2024 г. · A multiple-choice question answering test that covers 57 tasks including elementary mathematics, US history, computer science, law, and more.

Запросы по теме

mmlu benchmark example

mmlu это