THINK-Bench

Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models

1School of Artificial intelligence, JiLin University,
2Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, JiLin University,
3International Center of Future Science, JiLin University

Introduction

Reasoning models have achieved significant advancements in handling complex tasks, often surpassing traditional large language models. However, the challenge of overthinking remains common, significantly hindering computational efficiency. This issue arises as models produce an excess of redundant tokens that contribute little to the accuracy of answers, particularly in simpler tasks, resulting in considerable waste of resources.

To address this issue systematically, we introduce Think-Bench, a benchmark designed to evaluate the thinking efficiency of large reasoning models (LRMs). We propose a new efficiency metric and conduct a comprehensive analysis of LRMs from multiple aspects, including the reasoning process and chain-of-thought (CoT) characteristics.

Leveraging the Think-Bench benchmark and a novel evaluation strategy, we conduct a comprehensive analysis of large reasoning models (LRMs), uncovering several key insights: (1) Most LRMs tend to overthink on simple tasks, generating unnecessarily long reasoning chains, while they show higher efficiency in hard problems; (2) There is a significant trade-off between efficiency and CoT quality among different models. Grok-3-mini-beta achieves the highest efficiency score, while models like Qwen3-235b-a22b and Ernie-x1-turbo-32k stand out in CoT quality; (3) Models show task heterogeneity in different disciplinary tasks. Mathematical tasks generally have high token consumption and low reasoning efficiency, while chemistry and physics tasks show higher reasoning efficiency and lower token occupancy rate. We hope Think-Bench serves as an important benchmark for optimizing the performance of large reasoning models in the future.

grade-lv

The performance of various LRMs onThink-Bench.

Leaderboard

Model name Efficiency Recall Precision Accuracy Reflection
Quality
Thought
Num
Tokens Useful
Tokens
Reflection
Tokens
Claude-3.7-sonnet 49.61% 81.29% 86.26% 94.25% 76.49% 0.28 942.82 446.09 496.73
Deepseek-r1-distill-qwen-1.5b 37.14% 47.10% 59.61% 62.91% 61.88% 8.00 3734.49 1268.36 2466.13
Deepseek-r1-distill-qwen-7b 49.53% 63.65% 77.29% 68.51% 77.70% 9.42 3504.76 1641.91 1862.85
Deepseek-r1-distill-qwen-14b 50.70% 61.04% 79.97% 70.18% 82.40% 7.04 2814.75 1413.09 1401.66
Deepseek-r1-distill-qwen-32b 52.62% 64.17% 83.76% 75.93% 84.46% 6.27 2697.70 1352.93 1344.77
Deepseek-r1 48.96% 80.80% 88.33% 88.80% 90.92% 9.17 3795.19 1912.12 1883.07
Ernie-x1-turbo-32k 47.02% 82.03% 88.67% 89.89% 90.97% 12.75 4692.21 2221.32 2470.89
Grok-3-mini-beta 61.69% 81.56% 86.51% 91.85% 88.20% 0.38 1891.34 1169.05 722.29
Qwen3-235b-a22b 46.14% 85.80% 86.97% 94.91% 92.16% 13.35 4969.05 2448.29 2520.76
Qwq-plus 44.58% 80.40% 85.08% 89.60% 89.67% 22.63 5738.37 2646.73 3091.64
Glm-z1-air 47.41% 80.16% 83.18% 88.06% 89.17% 9.80 3678.68 1775.07 1903.61

Think-Bench Dataset

Overview

Think-Bench is a pioneering benchmark designed to systematically evaluate thinking efficiency and Chain-of-Thought (CoT) quality in Large Reasoning Models (LRMs). Comprising 1,375 questions across mathematics, physics, and chemistry with a balanced mix of simple and difficulty tasks, the benchmark features 13,311 manually curated key reasoning step annotations to analyze model behavior in structured, multi-step reasoning processes.

data-overview

Overview of Think-Bench.

data-overview

Distribution of Think-Bench.

data-overview

Key statistics of Think-Bench.

Evaluation Strategy

Think-Bench introduces a comprehensive framework to evaluate thinking efficiency and Chain-of-Thought (CoT) quality through six efficiency metrics, two CoT quality metrics, and Accuracy:

Efficiency Metrics:
  • Tokens: Total token count to measure thinking cost.
  • First Correct Tokens: Tokens used to reach the first correct answer, assessing reasoning speed.
  • Efficiency Score: Ratio of first correct tokens to total tokens, quantifying resource utilization.
  • Reflection Quality: Proportion of valid self-verifications (e.g., error detection, new insights) among reflective steps.
  • Reflection Tokens: Tokens used for post-answer verification, indicating redundancy.
  • Thought Num: Frequency of reasoning path changes, reflecting stability.
CoT Quality Metrics:
  • Recall: Proportion of essential reasoning steps captured in the modelโ€™s output.
  • Precision: Ratio of correct and relevant steps to total generated steps, penalizing logical errors.
Accuracy: Proportion of correct answers.

pipeline

Pipeline of Thinking Efficiency and CoT Quality Evaluation.

Experiment Results

More Results

Efficacy Evaluate Example

Error Example

BibTeX

@misc{li2025thinkbench,
      title={THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models}, 
      author={Zhiyuan Li and Yi Chang and Yuan Wu},
      year={2025},
      eprint={2505.22113},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.22113}, 
}