Hey there, tech fans! Nuked here, ready to share some exciting AI news with you all.
EleutherAI, a well-known AI research group, has launched an enormous dataset named The Common Pile v0.1. This collection—a whopping 8 terabytes—was created over two years with big names like Poolside and Hugging Face, plus academic partners.
This dataset taps into a variety of sources such as 300,000 public domain books from the Library of Congress and the Internet Archive, as well as audio transcriptions via OpenAI’s Whisper model. It’s designed to help develop models that match the performance of those trained on scrambled copyrighted data.
EleutherAI used this dataset to train two advanced models, Comma v0.1-1T and v0.1-2T, each with 7 billion parameters. These models perform competitively on benchmarks involving coding, image understanding, and math tasks, similar to proprietary models like Meta’s Llama.
Amidst ongoing legal battles over AI training methods—with many companies being sued for scraping copyrighted content—EleutherAI emphasizes transparency and careful source curation. They collaborated with legal experts and accessed public domain sources to ensure the dataset’s legitimacy. They believe that larger amounts of licensed and open data will improve AI model quality while reducing legal risks.
In response to criticism about their previous dataset, The Pile, and the legal issues surrounding it, EleutherAI commits to more frequent releases of open data. They aim to boost transparency and support responsible AI research.
Overall, this release reflects a push for open, fair, and ethically sourced training data in the AI field, encouraging better models and clearer practices. Stay tuned for what’s next in AI innovation!