Major tech companies like Apple, Anthropic, Nvidia, and Salesforce have used a large, unauthorized dataset of YouTube captions to train their artificial intelligence (AI) systems. According to a joint investigation by Proof News and Wired, the dataset consists of captions from over 170,000 YouTube videos, covering content from over 48,000 channels. However, this dataset only includes captions from videos, and does not include any visual content.
This dataset includes videos from popular YouTubers like MrBeast and Marques Brownlee, news sources like ABC News, BBC, and The New York Times, as well as The Verge and Vox, among many others. Marques Brownlee stated in a post on the X platform that Apple collects data for its AI from various companies, and that one of these companies collects a large amount of data and subtitles from YouTube videos.
YouTube declined to comment on the dataset, but YouTube CEO Neal Mohan said the use of video content and transcripts in AI training violates the platform’s terms of use. Google CEO Sundar Pichai also supported that view, saying companies developing AI should abide by YouTube’s terms of use.
This caption dataset is part of a larger open-source collection called The Pile, created by EleutherAI. The Pile is comprised of datasets that include books, Wikipedia articles, and more. Last year, an analysis of a dataset called Books3 revealed that authors’ works were being used to train AI systems, leading authors to file lawsuits against the companies.
Lack of transparency from AI companies
AI companies are often opaque about the data they use to train their systems. How YouTube content, in particular, is used has become a hot topic in recent months. When OpenAI introduced its powerful video production tool Sora, CTO Mira Murati sidestepped questions about whether the system was trained using YouTube videos, stating that only “publicly available or licensed data” was used.
Proof News has introduced an interactive search tool where users can check if their content is included in this dataset. This tool allows users or their favorite YouTubers to see if their content is included in this dataset.