OpenAI transcribed over a million hours of YouTube videos to train GPT-4
Briefly

OpenAI resorted to questionable methods like transcribing over a million hours of YouTube videos to train AI models, leading to legal concerns. This approach highlights the challenges AI companies face in obtaining quality training data.
Companies like OpenAI are exploring various sources for unique datasets, aiming to enhance their models' understanding of the world. Strategies include publicly available data, non-public partnerships, and potentially synthetic data creation.
Google expressed concerns over OpenAI's methods, citing violations of their robots.txt files and Terms of Service restricting unauthorized scraping of YouTube content. This underscores the legal and ethical complexities AI companies encounter in data acquisition.
Read at The Verge
[
add
]
[
|
|
]