A hot potato: Once again, it's been revealed that a company has been scraping data from the internet to train its AI models using a questionable interpretation of copyright law. On this occasion, Nvidia has been downloading videos from YouTube, Netflix, and other platforms to gather data for its commercial AI products.
According to internal Slack chats, emails, spreadsheets, and several other sources obtained by 404 Media, Nvidia asked workers to download videos from various online platforms to compile data to train its Omniverse, autonomous vehicles, and digital human products.
Codenamed Cosmos, the project involved using between 20 and 30 virtual machines on Amazon Web Services to download the equivalent of 80 years of videos every day. Nvidia was downloading so much that it managed to accumulate over 30 million URLs in the space of one month.
In addition to Netflix and YouTube, Nvidia workers were told to train the AI models on movie trailer database MovieNet, internal libraries of video game footage, and Github video datasets WebVid, which have since been taken down. It also used InternVid-10M, a dataset containing 10 million YouTube video IDs.
Copyright issues are always at the forefront of discussions when it comes to companies scraping data from the web. This was reportedly discussed by Nvidia employees, who used several methods to try to circumvent any potential legal blowback, including using data marked as for academic or non-commercial purposes only.
HD-VG-130M was one of the datasets Nvidia used. This library of 130 million YouTube videos states in its license that it's for academic use only, something Nvidia appears to have ignored. Employees also used Google's cloud service to download the YouTube-8M dataset, as directly downloading the videos isn't allowed under the terms of service.
"We cleared the download with Google/YouTube ahead of time and dangled as a carrot that we were going to do so using Google Cloud," wrote one person in a Slack channel. "After all, usually, for 8 million videos, they would get lots of ad impressions, revenue they lose out on when downloading for training, so they should get some money out of it."
Nvidia also reportedly used VMs with rotating IP addresses in some cases to avoid YouTube detecting what it was doing and banning the users.
In April, it was reported that in order to access more reputable English language-based text on the internet in 2021, OpenAI researchers created a speech recognition tool called Whisper. It was designed to transcribe audio from YouTube videos, giving the company a trove of data to train its LLMs. Why didn't Google object? Possibly because it also transcribed YouTube videos for its AI models, potentially violating creators' copyrighted material.
YouTube previously said that scraping data to train AI models was a "clear violation" of its terms. Nvidia told 404 Media that its actions were "in full compliance with the letter and the spirit of copyright law."
If you were wondering whether Nvidia used gameplay footage from its own GeForce Now service to train its AI – no, it didn't, though it sounds like such a thing could happen at one point. "We don't yet have statistics or video files yet, because the infras is not yet set up to capture lots of live game videos & actions," a senior Nvidia research scientist told other employees. "There're both engineering & regulatory hurdles to hop through."
Many AI firms engaging in data scraping practices defend their actions by claiming it's fair use under copyright law. Music-generating AI startups Udio and Suno are using this excuse in their copyright lawsuits filed by major record companies.