ICPL Baseline Methods: Disagreement Sampling and PrefPPO for Reward Learning | HackerNoonThe disagreement sampling scheme enhances reward learning by using variance-driven selection of trajectory pairs.