Qwen2.5-7B-Intuitor-MATH-1EPOCH

This model is a fine-tuned version of Qwen/Qwen2.5-7B trained on the MATH dataset for one epoch using the Intuitor method.

Description

Intuitor is an implementation of Reinforcement Learning from Internal Feedback (RLIF), presented in the paper Learning to Reason without External Rewards. RLIF enables large language models to learn from intrinsic signals—specifically "self-certainty"—without relying on external rewards, gold labels, or verifiers. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, allowing for effective unsupervised learning across reasoning domains like mathematics and code generation.

Citation

@article{zhao2025learning,
  title   = {Learning to Reason without External Rewards},
  author  = {Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal = {arXiv preprint arXiv:2505.19590},
  year    = {2025}
}
Downloads last month
3
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sunblaze-ucb/Qwen2.5-7B-Intuitor-MATH-1EPOCH

Base model

Qwen/Qwen2.5-7B
Finetuned
(909)
this model

Paper for sunblaze-ucb/Qwen2.5-7B-Intuitor-MATH-1EPOCH