【Coursera GenAI with LLM】 Week 3 Reinforcement Learning from Human Feedback Class Notes

MiraMira發表於2024-03-15

Helpful? Honest? Harmless? Make sure AI response in those 3 ways.

If not, we need RLHF is reduce the toxicity of the LLM.

Reinforcement learning: is a type of machine learning in which an agent learns to make decisions related to a specific goal by taking actions in an environment, with the objective of maximizing some notion of a cumulative reward. RLHF can help making personalized LLMs.

RLHF cycle: iterate until reward score is high:

  1. Select an instruct model, define your model alignment criterion (ex. helpfulness)
  2. obtain human feedback through labeler workforce to rate the completions
  3. Convert rankings into pairwise training data for the reward model
  4. Train reward model to predict preferred completion from {y_j, y_k} for prompt x
  5. Use the reward model as a binary classifier to automatically provide reward value for each prompt-completion pair
    lower reward score, worse the performance
    softmax(logits) = probabilities

RL Algorithm

  • RL algorithm updates the weights off the LLM based on the reward is signed to the completions generated by the current version off the LLM
  • ex. Q-Learning, PPO (Proximal Policy Optimization, the most popular method)
  • PPO optimize LLM to more aligned with human preferences

Reward hacking: the model will achieve high reward score but it actually doesn't align with the criterion, the quality is not improved

  • To avoid this, we can use the initial instruct model (aka reference model). * during training, we pass prompt dataset to both reference model and RL-updated LLM,

  • Then, we calculate KL Divergence Shift Penalty (a statistical measure of how different two probability distributions are) between two models

  • Add the penalty to the Reward Model, then go through PPO, PEFT, and back to reward model

Constitutional AI

  • First proposed in 2022 by researchers at Anthropic
  • a method for training models using a set of rules and principles that govern the model's behavior.

Red Teaming: make it to generate harmful responses. Then, remove all harmful responses

相關文章