OpenAI transcribed over a million hours of YouTube videos to train GPT-4

from The Verge 11 months ago

OpenAI resorted to questionable methods like transcribing over a million hours of YouTube videos to train AI models, leading to legal concerns. This approach highlights the challenges AI companies face in obtaining quality training data.
The Vergehttps://www.theverge.com/2024/4/6/24122915/openai-youtube-transcripts-gpt-4-training-data-google

Companies like OpenAI are exploring various sources for unique datasets, aiming to enhance their models' understanding of the world. Strategies include publicly available data, non-public partnerships, and potentially synthetic data creation.
The Vergehttps://www.theverge.com/2024/4/6/24122915/openai-youtube-transcripts-gpt-4-training-data-google

Google expressed concerns over OpenAI's methods, citing violations of their robots.txt files and Terms of Service restricting unauthorized scraping of YouTube content. This underscores the legal and ethical complexities AI companies encounter in data acquisition.
The Vergehttps://www.theverge.com/2024/4/6/24122915/openai-youtube-transcripts-gpt-4-training-data-google

Read at The Verge

#ai-companies #training-data #copyright-law #data-acquisition #ethical-concerns

Collection

[

...

]

OpenAI transcribed over a million hours of YouTube videos to train GPT-4OpenAI transcribed over a million hours of YouTube videos to train GPT-4 Briefly

OpenAI transcribed over a million hours of YouTube videos to train GPT-4
OpenAI transcribed over a million hours of YouTube videos to train GPT-4
Briefly