Fair human-centric image dataset for ethical AI benchmarking - Nature
Briefly

Fair human-centric image dataset for ethical AI benchmarking - Nature
"Image datasets have played a foundational role in the history of AI development, with ImageNet12 enabling the rise of deep learning methods in the early 2010s13. While AI technologies have made tremendous strides in their capabilities and adoption since then, bias in data and models remains a persistent challenge2,14. Inadequate evaluation data can result in fairness and robustness issues, making it challenging to identify potential harms. These harms include the perpetuation of racist, sexist and physiognomic stereotypes2,4, as well as the exclusion or misrepresentation of entire populations. Such data inadequacies therefore compromise the fairness and accuracy of AI models."
"The large-scale scraping of images from the web without consent not only exacerbates issues related to data bias, but can also present legal issues, particularly related to privacy and intellectual property (IP)20. Consequently, prominent datasets have been modified or retracted8. Moreover, the lack of fair compensation for data and annotations presents critical concerns about the ethics of supply chains in AI development21,22."
"Datasets made available by government agencies such as NIST23 or using third-party licensed images24 often have similar issues with the absence of informed consent and compensation. Many dataset developers mistakenly assume that using images with Creative Commons licences addresses relevant privacy concerns3. Only a few consent-based fairness datasets with self-reported labels exist. However, these datasets have little geographical diversity. They also lack pixel-level annotations, meaning that they can be used for only a small number of human-centric computer vision tasks3. Evaluating models and mitigating bias are key for ethical AI development. Recent methods such as PASS28, FairFaceVar29 and MultiFair30 aim to reduce demographic leakage or enforce fairness constraints through adversarial training and fairness-aware representations."
Image datasets enabled significant AI progress but retain persistent biases that undermine fairness and robustness. Large-scale scraping without consent amplifies bias and creates privacy and intellectual property risks, prompting dataset modifications and retractions. Lack of fair compensation for data and annotations raises ethical supply-chain concerns. Government and third-party licensed image collections often lack informed consent and compensation. Few consent-based fairness datasets exist, and those available lack geographic diversity and pixel-level annotations, limiting applicability. Evaluation and mitigation methods such as PASS, FairFaceVar, and MultiFair aim to reduce demographic leakage and enforce fairness constraints in model development.
Read at Nature
Unable to calculate read time
[
|
]