Students take standardized tests in school to evaluate their performance relative to their classmates. This is the same principle behind artificial intelligence (AI) benchmarks, which test each model’s abilities and show companies where they need to improve. While there’s no perfect test, benchmarks can help AI companies create safer, more reliable models for personal and business use.
An AI system is a model that uses machine learning to analyze data, generate content, and make judgments. Before companies release AI systems to the public, the models undergo training and testing. Yet, even after rigorous tests, models can provide false information or generate harmful content.
This is where AI benchmarks come into play. A benchmark is a test that evaluates the model’s performance and compares it to other systems or a standardized set of answers. When companies get the results, they can spot areas that need improvement and assess how their models compare to other AI software on the market.
How do AI benchmarks work?

During the test, AI models complete a set of tasks that may include pulling information from a question or image dataset. Common tasks include translation, object recognition, code generation, language comprehension, and reasoning. Afterward, the developers receive a test score that helps them gauge their model’s usefulness.
- Accuracy: The model’s ability to provide correct responses.
- Precision: The number of correct vs. incorrect predictions.
- Recall: The number of true positives the model predicts.
- F1 score: The model’s precision and recall scores combined.
- Latency: The model’s response time.
Prominent AI benchmarks and their creators
Some of the most famous AI benchmarks include:
- ImageNet: Stanford University created ImageNet to test image recognition.
- GLUE/SuperGLUE: DeepMind, New York University, and the Allen School developed GLUE to evaluate language comprehension. SuperGLUE is the more advanced version.
- COCO: Sponsored by Facebook, Microsoft, and others, COCO uses a large dataset to test object recognition.
- SQuAD: Stanford University developed SQuAD as a benchmark for reading comprehension.
- BLEU: IBM created this classic benchmark in 2002 to evaluate translation quality.
- LiveBench: Created by Abacus.AI, LiveBench tests models for data contamination.
- WinoGrande: The University of Washington and the Allen Institute developed WinoGrande as a benchmark for reasoning capabilities.
- HumanEval: OpenAI built HumanEval to test a model’s code-generation abilities.
What benchmark results can tell us
Benchmark results tell developers how well their AI performs compared to other models. Once they’ve analyzed their model’s strengths and drawbacks, they may release it to the public or continue fine-tuning the software to improve its score. This helps developers set goals and meet their deadlines.
Additionally, benchmarks provide a look at AI’s overall progress. As models pass these exams, developers create more advanced benchmarks that challenge AI to reach new heights. These exams may contain larger datasets, higher standards, increasingly difficult tasks, and more resources for creators.
With this information, businesses can choose the best model for their tasks. Depending on their needs, they may want software that excels in translation, text analysis, image detection, mathematics, predictions, or writing. High benchmark scores may give companies confidence that they’re investing in the right software.
Shortcomings and limitations of AI benchmarks
Benchmarks can encourage developers to “teach the test” by creating bots that pass their exams but fail to meet real-world challenges, such as summarizing documents or generating poetry. When businesses train their models on a limited dataset, the software might glitch when it encounters a prompt it’s never seen before.
These datasets could also have biases that influence the model’s scores or encourage it to produce harmful materials. When scrapers collect text and pictures for these datasets, for example, they may include racist or sexist materials that lead to microaggressions.
Many benchmarks are outdated, leading to flawed results. While a model might earn a perfect score, the exam could have pulled information from old or broken websites that hosted inaccurate data. This requires companies to test their models again with advanced benchmarks, such as the ultra-challenging “Humanity’s Last Exam.”
Additionally, some tasks are simply difficult to evaluate. A benchmark could test a model for basic facts, but it may struggle to comprehend advanced concepts, such as writing an original song or answering trick questions. Without rigorous training, an AI might underwhelm users by failing to emulate anticipated human traits, such as common sense.