Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog. — Kunal

GRPO

Learning GRPO with claude's help.

General outline
- prompt -> llm -> completion -> reward (number measuring "goodness")
- reward is on the output text
- reinforce RL algorithm: backpropagate on the probability of generating the chosen tokens scaled by reward
- use (reward - baseline = advantage) * probability
  - baseline is learned by a "critic"/"value function" in PPO; critic expects a reward
  - prompt -> critic network -> expected reward
  - prompt -> policy network (LLM) -> Completion -> actual reward
  - hard because of the second model, and updating the critic
- GRPO sidesteps a learned critic, and computes a reward empirically
  - generate a group of completions per prompt
  - baseline = avrg of reward for group
  - advantage = delta from group average
- PPO clipping
  - ppo prevents policy changes from changing too much so that data isn't too far from model
- KL Penalty
  - prevent going too far from the reference model