Apple Trained Its AI on YouTube Transcripts Without Permission, Report Says [CNET]

View Article on CNET

YouTube creators were unaware that tech companies were using transcripts of their content to train AI systems.

An investigation from Proof News claims some of the world’s largest tech companies, including Apple and Nvidia, are training AI systems with YouTube video transcripts without creators’ permission.

The report, which includes a search tool to determine if a YouTube channel is in the dataset, says “subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple and Salesforce.” Some of the YouTube channels included in the dataset are of late-night shows such as The Late Show with Stephen Colbert and Jimmy Kimmel Live as well as content from popular YouTube personalities including MrBeast, tech reviewer Marques Brownlee and PewDiePie.

Proof News said that the dataset was part of a compilation called the Pile that came from a nonprofit, EleutherAI. In a 2020 research paper, the nonprofit described the Pile as containing 22 separate datasets.

Apple, Anthropic and EleutherAI didn’t immediately respond to requests for comment. Nvidia declined to comment.

In an email to CNET, a spokesperson from Google said the company stands by its previous statements on the subject, linking to a Bloomberg article from April. In the article, Google CEO Neal Mohan said he doesn’t know if OpenAI did in fact use YouTube videos to train its text-to-video generator, but that if it did, that is a violation of the platform’s terms of service. He didn’t address whether Google itself used the videos in this way.

AI Atlas art badge tag

While AI continues to be a key technology pursued by tech titans including Apple, Google, Microsoft, Meta and IBM, evolving the technology requires feeding AI models gigantic amounts of data. Leaders in the space, including OpenAI, have acknowledged that it’s getting harder and harder to find datasets to train AI systems. That has led OpenAI, the creator of ChatGPT,  to negotiate deals with content companies, including News Corp. and Reddit, in order to acquire content to feed the AI systems.

The information in the report, however, suggests that tech companies such as Apple and Nvidia may be gobbling up datasets containing information that, at least in spirit, doesn’t align with what content creators would expect from a platform like YouTube, which ostensibly prohibits data mining of videos or transcripts of videos.

A spokesperson for Anthropic, a public benefit AI startup, told Proof News that it uses the Pile to train its AI assistant Claude and said, “The Pile includes a very small subset of YouTube subtitles.” 

Spokesperson Jennifer Martinez said, “YouTube’s terms cover direct use of its platform, which is distinct from use of The Pile dataset. On the point about potential violations of YouTube’s terms of service, we’d have to refer you to The Pile authors.”

As the report points out, Google itself has been taken to task for mining YouTube content. The company told the New York Times that its agreement with content creators allows for YouTube content to be used for AI training. 

Other Services & Software