arxiv:2604.08865

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Published on Apr 10

· Submitted by

Yixia Li on Apr 15

NLP Group in SUSTech

Upvote

Authors:

Tianyi Wang ,

Yixia Li ,

Abstract

Sequence-Level PPO addresses instability in long-chain-of-thought reasoning by reformulating the process as a contextual bandit problem with decoupled value functions for improved efficiency.

AI-generated summary

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.

View arXiv page View PDF GitHub 19 Add to collection

Community

X1AOX1A

Paper author Paper submitter 2 days ago

We introduce SPPO (Sequence-Level PPO), a scalable RL algorithm for aligning reasoning LLMs that resolves the fundamental tension between PPO's unstable credit assignment and GRPO's costly multi-sampling.

Standard token-level PPO struggles in long Chain-of-Thought (CoT) reasoning due to the "Tail Effect" — the critic overfits positional cues and fails to propagate sparse rewards across thousands of tokens. While GRPO sidesteps this with group-based baselines, it demands N>1 samples per prompt, severely bottlenecking training throughput.

Our key insight: GRPO's success stems from implicitly treating reasoning as a Sequence-Level Contextual Bandit. SPPO makes this explicit — collapsing the entire reasoning chain into a single atomic action and employing a learned scalar value function V(s_p) to estimate prompt solvability, enabling stable single-sample (N=1) updates.

Highlights:

🏆 Outperforms standard PPO and matches GRPO (N=8) on AIME24/25, AMC23, MATH500, and Minerva Math at both 1.5B and 7B scales
⚡ 5.9× training speedup over GRPO with single-sample efficiency
🧠 Decoupled Critic: a lightweight 1.5B critic successfully aligns a 7B policy, reducing VRAM by 12.8% while achieving the highest average score (58.56)
🔬 Validated beyond LLMs on classic control tasks (CartPole, Hopper, MountainCar, LunarLander, Pendulum) under the RLVR framework

📄 Paper (ACL 2026 Main): https://arxiv.org/abs/2604.08865
💻 Code: https://github.com/sustech-nlp/SPPO