Many large language models (LLMs) are trained on data scraped from the web that is often "unstructured, noisy, and poorly phrased," making it harder to use for training.
Rather than creating synthetic data, WRAP uses an "off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles such as 'like Wikipedia' or in 'question-answer format.'"
According to the report, WRAP sped up pretraining by about three times when used on a "naturally noisy" dataset.
#generative-ai #large-language-models #data-quality #training-techniques #web-rephrase-augmented-pre-training
Collection
[
|
...
]