Here we go again: Giant corporations, including Apple and Nvidia, have used video transcripts from thousands of YouTube creators for AI training without consent or compensation. The news is not that surprising as it seems par for the course. They are simply joining the ranks of Microsoft, Google, Meta, and OpenAI in the unethical use of copyrighted material.
An investigation by Proof News has uncovered that some of the wealthiest AI companies, including Anthropic, Nvidia, Apple, and Salesforce, have used material from thousands of YouTube videos to train their AI models. This practice directly contradicts YouTube's terms of service, prohibiting data harvesting from the platform without permission, but follows a trend set by Google, OpenAI, and others.
The data, called "YouTube Subtitles," is a subset of a larger dataset called "The Pile." It includes transcripts from 173,536 YouTube videos from over 48,000 channels spanning educational content providers like Khan Academy, MIT, and Harvard, as well as popular media outlets like The Wall Street Journal, NPR, and the BBC. The cache even includes entertainment shows like "The Late Show With Stephen Colbert." Even YouTube megastars like MrBeast, Jacksepticeye, and PewDiePie have content in the cache.
Proof News Contributor Alex Reisner uncovered The Pile last year. It contains scraps of everything, from copyrighted books and academic papers to online conversations and YouTube Closed Caption transcripts. In response to the find, Reisner created a searchable database of the content because he felt that IP owners should know whether AI companies are using their work to train their systems.
"I think it's hard for us as a society to have a conversation about AI if we don't know how it's being built," Reisner said. "I thought YouTube creators might want to know that their work is being used. It's also relevant for anyone who's posting videos, photos, or writing anywhere on the internet because right now AI companies are abusing whatever they can get their hands on."
David Pakman, host of "The David Pakman Show," expressed his frustration, revealing that he found nearly 160 of his videos in the dataset. These transcripts were taken from his channel, stored, and used without his knowledge. Pakman, whose channel supports four full-time employees, argued that he deserves compensation if AI companies benefit financially from his work. He highlighted the substantial effort and resources invested in creating his content, describing the unauthorized use as theft.
"No one came to me and said, 'We would like to use this,'" said Pakman. "This is my livelihood, and I put time, resources, money, and staff time into creating this content. There's really no shortage of work."
Dave Wiskus, CEO of the creator-owned streaming service Nebula, echoed this sentiment, calling the practice disrespectful and exploitative. He warned that generative AI could potentially replace artists and harm the creative industry. Compounding the problem is that some large content producers like the Associated Press are penning lucrative deals with AI creators while smaller ones are having their work stolen without notice.
The investigation revealed that EleutherAI is the company behind The Pile dataset. Its stated goal is to make cutting-edge AI technologies available to everyone. However, its methods raise ethical concerns – primarily those of the hush-hush deals made with big AI players. Various AI developers, including multitrillion-dollar tech giants like Apple and Nvidia, have used The Pile dataset to train their models. None of the companies involved have responded to requests for comment.
Lawmakers have been slow to respond to the various threats that AI brings. After years of deepfake technology advances and abuses, the US Senate finally introduced a bill to curb deepfake and AI abuse dubbed the "Content Origin Protection and Integrity from Edited and Deepfaked Media Act" or COPIED Act. The bill aims to create a framework for the legal and ethical gray area of AI development. It promises transparency and an end to the rampant theft of intellectual property via internet scraping, among other things.