
"Why Does Data Matter? First, it is necessary to address the question of data itself. As it is well known, AI systems can only be designed if a large dataset is available to train the algorithm. From the training process of these AI models to the moment they generate outputs, the breadth and quality of the datasets available to them significantly affect their processing and generative capabilities."
"This dataset can come from various sources: it might be assembled through web scraping, which allows for the indiscriminate collection of data-for example, from social media-later labeled by human workers (often poorly paid). This is the method used by nearly all companies that have developed generative AI models, such as OpenAI, Google, and others. Alternatively, data can be meticulously assembled by humans, as in the case of AlphaFold, the AI system that predicts the 3D structure of proteins."
Artificial intelligence's capabilities and risks stem from the datasets used in training and generation. Large, high-quality datasets are required to design AI systems and to determine their processing and generative performance. Data sources include indiscriminate web scraping labeled by often poorly paid human workers and meticulously assembled human-curated datasets. Nearly all major generative AI companies have relied on scraped and labeled web data, while scientific breakthroughs like AlphaFold relied on decades of collaboratively assembled public protein-structure databases. Data-related problems underlie many AI harms, including biased facial recognition and other well-documented failures rooted in dataset composition and labeling practices.
Read at Apaonline
Unable to calculate read time
Collection
[
|
...
]