Despite the success of instruction tuning, the relative human judgments of response quality are often easier to collect than expert demonstrations, facilitating the optimization process.
Self-supervised language models of increasing scale can complete tasks in a zero-shot or few-shot manner, with performance enhanced significantly through fine-tuning on human-written completions.
Collection
[
|
...
]