Qwen2.5-7B-Intuitor-MATH-1EPOCH

This model is a fine-tuned version of Qwen/Qwen2.5-7B trained on the MATH dataset for one epoch using the Intuitor method.

Description

Intuitor is an implementation of Reinforcement Learning from Internal Feedback (RLIF), presented in the paper Learning to Reason without External Rewards. RLIF enables large language models to learn from intrinsic signals—specifically "self-certainty"—without relying on external rewards, gold labels, or verifiers. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, allowing for effective unsupervised learning across reasoning domains like mathematics and code generation.

Paper: Learning to Reason without External Rewards
Code: https://github.com/sunblaze-ucb/Intuitor

Citation

@article{zhao2025learning,
  title   = {Learning to Reason without External Rewards},
  author  = {Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal = {arXiv preprint arXiv:2505.19590},
  year    = {2025}
}

Downloads last month: 3

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for sunblaze-ucb/Qwen2.5-7B-Intuitor-MATH-1EPOCH

Base model

Qwen/Qwen2.5-7B

Finetuned

(909)

this model

Paper for sunblaze-ucb/Qwen2.5-7B-Intuitor-MATH-1EPOCH

Learning to Reason without External Rewards

Paper • 2505.19590 • Published May 26, 2025 • 31