Reasoning models have achieved significant advancements in handling complex tasks, often surpassing traditional large language models. However, the challenge of overthinking remains common, significantly hindering computational efficiency. This issue arises as models produce an excess of redundant tokens that contribute little to the accuracy of answers, particularly in simpler tasks, resulting in considerable waste of resources.
To address this issue systematically, we introduce Think-Bench, a benchmark designed to evaluate the thinking efficiency of large reasoning models (LRMs). We propose a new efficiency metric and conduct a comprehensive analysis of LRMs from multiple aspects, including the reasoning process and chain-of-thought (CoT) characteristics.
Leveraging the Think-Bench benchmark and a novel evaluation strategy, we conduct a comprehensive analysis of large reasoning models (LRMs), uncovering several key insights: (1) Most LRMs tend to overthink on simple tasks, generating unnecessarily long reasoning chains, while they show higher efficiency in hard problems; (2) There is a significant trade-off between efficiency and CoT quality among different models. Grok-3-mini-beta achieves the highest efficiency score, while models like Qwen3-235b-a22b and Ernie-x1-turbo-32k stand out in CoT quality; (3) Models show task heterogeneity in different disciplinary tasks. Mathematical tasks generally have high token consumption and low reasoning efficiency, while chemistry and physics tasks show higher reasoning efficiency and lower token occupancy rate. We hope Think-Bench serves as an important benchmark for optimizing the performance of large reasoning models in the future.
The performance of various LRMs onThink-Bench.
Model name | Efficiency | Recall | Precision | Accuracy | Reflection Quality |
Thought Num |
Tokens | Useful Tokens |
Reflection Tokens |
---|---|---|---|---|---|---|---|---|---|
Claude-3.7-sonnet | 49.61% | 81.29% | 86.26% | 94.25% | 76.49% | 0.28 | 942.82 | 446.09 | 496.73 |
Deepseek-r1-distill-qwen-1.5b | 37.14% | 47.10% | 59.61% | 62.91% | 61.88% | 8.00 | 3734.49 | 1268.36 | 2466.13 |
Deepseek-r1-distill-qwen-7b | 49.53% | 63.65% | 77.29% | 68.51% | 77.70% | 9.42 | 3504.76 | 1641.91 | 1862.85 |
Deepseek-r1-distill-qwen-14b | 50.70% | 61.04% | 79.97% | 70.18% | 82.40% | 7.04 | 2814.75 | 1413.09 | 1401.66 |
Deepseek-r1-distill-qwen-32b | 52.62% | 64.17% | 83.76% | 75.93% | 84.46% | 6.27 | 2697.70 | 1352.93 | 1344.77 |
Deepseek-r1 | 48.96% | 80.80% | 88.33% | 88.80% | 90.92% | 9.17 | 3795.19 | 1912.12 | 1883.07 |
Ernie-x1-turbo-32k | 47.02% | 82.03% | 88.67% | 89.89% | 90.97% | 12.75 | 4692.21 | 2221.32 | 2470.89 |
Grok-3-mini-beta | 61.69% | 81.56% | 86.51% | 91.85% | 88.20% | 0.38 | 1891.34 | 1169.05 | 722.29 |
Qwen3-235b-a22b | 46.14% | 85.80% | 86.97% | 94.91% | 92.16% | 13.35 | 4969.05 | 2448.29 | 2520.76 |
Qwq-plus | 44.58% | 80.40% | 85.08% | 89.60% | 89.67% | 22.63 | 5738.37 | 2646.73 | 3091.64 |
Glm-z1-air | 47.41% | 80.16% | 83.18% | 88.06% | 89.17% | 9.80 | 3678.68 | 1775.07 | 1903.61 |
Think-Bench is a pioneering benchmark designed to systematically evaluate thinking efficiency and Chain-of-Thought (CoT) quality in Large Reasoning Models (LRMs). Comprising 1,375 questions across mathematics, physics, and chemistry with a balanced mix of simple and difficulty tasks, the benchmark features 13,311 manually curated key reasoning step annotations to analyze model behavior in structured, multi-step reasoning processes.
Overview of
Think-Bench.
Distribution of
Think-Bench.
Key statistics of
Think-Bench.
Think-Bench introduces a comprehensive framework to evaluate thinking efficiency and Chain-of-Thought (CoT) quality through six efficiency metrics, two CoT quality metrics, and Accuracy:
Efficiency Metrics:Pipeline of Thinking Efficiency and CoT Quality Evaluation.
Comparative Performance of Models in Chemistry, Physics, and Math.
Evaluation Results of CoT and Efficiency in Think-Bench Classified by Difficulty Levels.
@misc{li2025thinkbench,
title={THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models},
author={Zhiyuan Li and Yi Chang and Yuan Wu},
year={2025},
eprint={2505.22113},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.22113},
}