Eleuther AI releases 8TB collection of licensed and open training data
Briefly

Eleuther AI has introduced a significant resource for the AI community with the launch of Common Pile v0.1, an 8TB database derived from publicly licensed texts and those in the public domain. This initiative, developed in collaboration with entities like the US Library of Congress and the University of Toronto, aims to facilitate training of AI systems ethically, in light of growing concerns regarding the use of copyrighted materials by generative AI companies. Eleuther AI's previous collection, The Pile, sparked debate, and this new database demonstrates a commitment to responsible data sourcing.
Eleuther AI has launched Common Pile v0.1, an 8TB text database comprised solely of publicly licensed or public domain texts, aimed at training AI models without copyright issues.
The development of Common Pile v0.1, which took two years, involved collaborations with organizations like the US Library of Congress and the University of Toronto, highlighting the extensive effort behind this resource.
This release comes in response to rising concerns over generative AI companies' use of copyrighted materials, with Eleuther AI hoping to demonstrate that significant training datasets can be sourced ethically.
Common Pile v0.1 aims to address copyright issues prevalent in the AI community by providing a legal alternative for developers interested in training their models responsibly.
Read at Computerworld
[
|
]