Categories

reward model overfitting

Reward hacking

Reward hacking is a situation where a model learns to optimise the reward signal while missing the real purpose of the task. The system technically does what it is rewarded ...

Reward hacking is a situation where a model learns to optimise the reward signal while missing the real purpose of the task. The system technically does what it is rewarded Read article

RLHF

RLHF, short for reinforcement learning from human feedback, is a training approach in which human preferences are used to help shape model behaviour. Instead of telling the model only what ...

RLHF, short for reinforcement learning from human feedback, is a training approach in which human preferences are used to help shape model behaviour. Instead of telling the model only what Read article

Overfitting

Overfitting is a situation where a machine learning model learns the training data too closely and performs poorly on new data. The model may look very accurate during training, but ...

Overfitting is a situation where a machine learning model learns the training data too closely and performs poorly on new data. The model may look very accurate during training, but Read article