#human-preference
#human-preference

[ follow ]

ICPL Baseline Methods: Disagreement Sampling and PrefPPO for Reward Learning | HackerNoon

The disagreement sampling scheme enhances reward learning by using variance-driven selection of trajectory pairs.

[ Load more ]