ICPL Baseline Methods: Disagreement Sampling and PrefPPO for Reward Learning | HackerNoon
Briefly

The disagreement sampling scheme employed in our study enhances the training process by firstly generating a larger batch of trajectory pairs and selectively querying pairs with high variance for reward model updates.
In the proxy human preference experiment, we meticulously set limits on human queries, allowing us to effectively measure the human effort involved in the reward learning process.
The pseudocode for both reward learning and PrefPPO provides essential methods for implementing our algorithms, showcasing the technical details necessary for replication and further research.
Our research highlights the intricate balance between maximizing the efficiency of learning from human feedback while minimizing the burden on human evaluators through strategic sampling.
Read at Hackernoon
[
|
]