Enough Thinking Efficient Reasoning via GRPO + SEAL + MCP

🏭 Overview

Large Reasoning Models (LRMs) frequently over-generate chain-of-thought, even for simple problems, leading to unnecessary latency and cost. In this project, we study reasoning efficiency as a first-class optimization objective.

📊 Contributions

We present a two-stage reinforcement learning framework:

Phase-1 (GRPO): induces structured reasoning behavior. Phase-2 (SEAL): internalizes recurring reasoning patterns to reduce token usage without sacrificing correctness.

📈 Impact

  • Phase-2 achieves 35–45% token reduction with only minor accuracy degradation