Categories

RLHF reward hacking

Reward hacking

Reward hacking is a situation where a model learns to optimise the reward signal while missing the real purpose of the task. The system technically does what it is rewarded ...

Reward hacking is a situation where a model learns to optimise the reward signal while missing the real purpose of the task. The system technically does what it is rewarded Read article

RLHF

RLHF, short for reinforcement learning from human feedback, is a training approach in which human preferences are used to help shape model behaviour. Instead of telling the model only what ...

RLHF, short for reinforcement learning from human feedback, is a training approach in which human preferences are used to help shape model behaviour. Instead of telling the model only what Read article