Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
— Kunal
GRPO
Learning GRPO with claude's help.
-
General outline
- prompt -> llm -> completion -> reward (number measuring "goodness")
- reward is on the output text
- reinforce RL algorithm: backpropagate on the probability of generating the chosen tokens scaled by reward
- use (reward - baseline = advantage) * probability
- baseline is learned by a "critic"/"value function" in PPO; critic expects a reward
- prompt -> critic network -> expected reward
- prompt -> policy network (LLM) -> Completion -> actual reward
- hard because of the second model, and updating the critic
- GRPO sidesteps a learned critic, and computes a reward empirically
- generate a group of completions per prompt
- baseline = avrg of reward for group
- advantage = delta from group average
- PPO clipping
- ppo prevents policy changes from changing too much so that data isn't too far from model
- KL Penalty
- prevent going too far from the reference model