THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models

Introduction

Reasoning models have achieved significant advancements in handling complex tasks, often surpassing traditional large language models. However, the challenge of overthinking remains common, significantly hindering computational efficiency. This issue arises as models produce an excess of redundant tokens that contribute little to the accuracy of answers, particularly in simpler tasks, resulting in considerable waste of resources.

To address this issue systematically, we introduce Think-Bench, a benchmark designed to evaluate the thinking efficiency of large reasoning models (LRMs). We propose a new efficiency metric and conduct a comprehensive analysis of LRMs from multiple aspects, including the reasoning process and chain-of-thought (CoT) characteristics.

Leveraging the Think-Bench benchmark and a novel evaluation strategy, we conduct a comprehensive analysis of large reasoning models (LRMs), uncovering several key insights: (1) Most LRMs tend to overthink on simple tasks, generating unnecessarily long reasoning chains, while they show higher efficiency in hard problems; (2) There is a significant trade-off between efficiency and CoT quality among different models. Grok-3-mini-beta achieves the highest efficiency score, while models like Qwen3-235b-a22b and Ernie-x1-turbo-32k stand out in CoT quality; (3) Models show task heterogeneity in different disciplinary tasks. Mathematical tasks generally have high token consumption and low reasoning efficiency, while chemistry and physics tasks show higher reasoning efficiency and lower token occupancy rate. We hope Think-Bench serves as an important benchmark for optimizing the performance of large reasoning models in the future.

The performance of various LRMs onThink-Bench.

Leaderboard

Model name	Efficiency	Recall	Precision	Accuracy	Reflection Quality	Thought Num	Tokens	Useful Tokens	Reflection Tokens
Claude-3.7-sonnet	49.61%	81.29%	86.26%	94.25%	76.49%	0.28	942.82	446.09	496.73
Deepseek-r1-distill-qwen-1.5b	37.14%	47.10%	59.61%	62.91%	61.88%	8.00	3734.49	1268.36	2466.13
Deepseek-r1-distill-qwen-7b	49.53%	63.65%	77.29%	68.51%	77.70%	9.42	3504.76	1641.91	1862.85
Deepseek-r1-distill-qwen-14b	50.70%	61.04%	79.97%	70.18%	82.40%	7.04	2814.75	1413.09	1401.66
Deepseek-r1-distill-qwen-32b	52.62%	64.17%	83.76%	75.93%	84.46%	6.27	2697.70	1352.93	1344.77
Deepseek-r1	48.96%	80.80%	88.33%	88.80%	90.92%	9.17	3795.19	1912.12	1883.07
Ernie-x1-turbo-32k	47.02%	82.03%	88.67%	89.89%	90.97%	12.75	4692.21	2221.32	2470.89
Grok-3-mini-beta	61.69%	81.56%	86.51%	91.85%	88.20%	0.38	1891.34	1169.05	722.29
Qwen3-235b-a22b	46.14%	85.80%	86.97%	94.91%	92.16%	13.35	4969.05	2448.29	2520.76
Qwq-plus	44.58%	80.40%	85.08%	89.60%	89.67%	22.63	5738.37	2646.73	3091.64
Glm-z1-air	47.41%	80.16%	83.18%	88.06%	89.17%	9.80	3678.68	1775.07	1903.61

Overview

Think-Bench is a pioneering benchmark designed to systematically evaluate thinking efficiency and Chain-of-Thought (CoT) quality in Large Reasoning Models (LRMs). Comprising 1,375 questions across mathematics, physics, and chemistry with a balanced mix of simple and difficulty tasks, the benchmark features 13,311 manually curated key reasoning step annotations to analyze model behavior in structured, multi-step reasoning processes.

Overview of Think-Bench.

Distribution of Think-Bench.

Key statistics of Think-Bench.

Evaluation Strategy

Think-Bench introduces a comprehensive framework to evaluate thinking efficiency and Chain-of-Thought (CoT) quality through six efficiency metrics, two CoT quality metrics, and Accuracy:

Efficiency Metrics:

Tokens: Total token count to measure thinking cost.
First Correct Tokens: Tokens used to reach the first correct answer, assessing reasoning speed.
Efficiency Score: Ratio of first correct tokens to total tokens, quantifying resource utilization.
Reflection Quality: Proportion of valid self-verifications (e.g., error detection, new insights) among reflective steps.
Reflection Tokens: Tokens used for post-answer verification, indicating redundancy.
Thought Num: Frequency of reasoning path changes, reflecting stability.

CoT Quality Metrics:

Recall: Proportion of essential reasoning steps captured in the model’s output.
Precision: Ratio of correct and relevant steps to total generated steps, penalizing logical errors.

Accuracy: Proportion of correct answers.

Pipeline of Thinking Efficiency and CoT Quality Evaluation.

More Results

Comparative Performance of Models in Chemistry, Physics, and Math.

Evaluation Results of CoT and Efficiency in Think-Bench Classified by Difficulty Levels.

Efficacy Evaluate Example

Error Example

BibTeX

@misc{li2025thinkbench,
      title={THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models}, 
      author={Zhiyuan Li and Yi Chang and Yuan Wu},
      year={2025},
      eprint={2505.22113},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.22113}, 
}

THINK-Bench

Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models