Our proposed method, In-Context Preference Learning (ICPL), effectively reduces the complexity of preference learning tasks by integrating large language models to autonomously produce and refine reward functions with human feedback.
Experimental results show that ICPL surpasses traditional Reinforcement Learning from Human Feedback (RLHF) in efficiency, and effectively competes with methods using ground-truth rewards, demonstrating its flexibility in capturing human intentions.
ICPL's success in complex tasks like humanoid jumping highlights its versatility, while its limitations in tasks relying on subjective video assessments prompt future exploration of integrating human preferences with artificial metrics.
The initial diversity of reward functions is crucial for ICPL's performance, but current limitations exist in generating this diversity solely through the language model, signaling a need for further study.
#in-context-preference-learning #reinforcement-learning #human-feedback #large-language-models #reward-functions
Collection
[
|
...
]