Train a 7B model that outperforms GPT-4o ?
How to unlock advanced reasoning via scalable RL?
Tsinghua team proposed a new work:PRIME (Process Reinforcement through Implicit Rewards) and Eurus-2, trained from Base model to surpass Qwen2.5-Math-Instruct using only 1/10 of the data.
The open-source community has relied heavily on data-driven imitation learning for reasoning capabilities. While RL is known to be the way to go, two key challenges held us back:
- Precise and scalable dense rewards
- RL algorithms that can fully utilize these rewards
Their solution: implicit process reward modeling.