AI companies are finally being forced to cough up for training data [MIT Tech Review]

View Article on MIT Tech Review
This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

The generative AI boom is built on scale. The more training data, the more powerful the model. 

But there’s a problem. AI companies have pillaged the internet for training data, and many websites and data set owners have started restricting the ability to scrape their websites. We’ve also seen a backlash against the AI sector’s practice of indiscriminately scraping online data, in the form of users opting out of making their data available for training and lawsuits from artists, writers, and the New York Times, claiming that AI companies have taken their intellectual property without consent or compensation. 

Last week three major record labels—Sony Music, Warner Music Group, and Universal Music Group—announced they were suing the AI music companies Suno and Udio over alleged copyright infringement. The music labels claim the companies made use of copyrighted music in their training data “at an almost unimaginable scale,” allowing the AI models to generate songs that “imitate the qualities of genuine human sound recordings.” My colleague James O’Donnell dissects the lawsuits in his story and points out that these lawsuits could determine the future of AI music. Read it here

But this moment also sets an interesting precedent for all of generative AI development. Thanks to the scarcity of high-quality data and the immense pressure and demand to build even bigger and better models, we’re in a rare moment where data owners actually have some leverage. The music industry’s lawsuit sends the loudest message yet: High-quality training data is not free. 

It will likely take a few years at least before we have legal clarity around copyright law, fair use, and AI training data. But the cases are already ushering in changes. OpenAI has been striking deals with news publishers such as Politico, the AtlanticTime, the Financial Times, and others, and exchanging publishers’ news archives for money and citations. And YouTube announced in late June that it will offer licensing deals to top record labels in exchange for music for training. 

These changes are a mixed bag. On one hand, I’m concerned that news publishers are making a Faustian bargain with AI. For example, most of the media houses that have made deals with OpenAI say the deal stipulates that OpenAI cite its sources. But language models are fundamentally incapable of being factual and are best at making things up. Reports have shown that ChatGPT and the AI-powered search engine Perplexity frequently hallucinate citations, which makes it hard for OpenAI to honor its promises.   

It’s tricky for AI companies too. This shift could lead to them build smaller, more efficient models, which are far less polluting. Or they may fork out a fortune to access data at the scale they need to build the next big one. Only the companies most flush with cash, and/or with large existing data sets of their own (such as Meta, with its two decades of social media data), can afford to do that. So the latest developments risk concentrating power even further into the hands of the biggest players. 

On the other hand, the idea of introducing consent into this process is a good one—not just for rights holders, who can benefit from the AI boom, but for all of us. We should all have the agency to decide how our data is used, and a fairer data economy would mean we could all benefit. 


Now read the rest of The Algorithm

Deeper Learning

How AI video games can help reveal the mysteries of the human mind

Neuroscientists and psychologists have long been using games as research tools to learn about the human mind. Video games have been either co-opted or specially designed to study how people learn, navigate, and cooperate with others, for example. AI video games—where characters don’t need scripts and appear to play when you’re not watching—could allow us to probe more deeply and unravel enduring mysteries about our brains and behavior, suggests my colleague Jessica Hamzelou in our weekly biotech newsletter, The Checkup.

Ready, set, go: Scientists who have done this type of study were able to observe and study how players behaved in these games: how they explored their virtual environment, how they sought rewards, how they made decisions. And research volunteers didn’t need to travel to a lab—their gaming behavior could be observed from wherever they happened to be playing, whether that was at home, at a library, or even inside an MRI scanner. Read more from Jessica.

Bits and Bytes

AI is already wreaking havoc on global power systems
A really well-done data visualization of the insane amount of electricity AI requires and how it is transforming our energy grid. A startling statistic: Data centers use more electricity than most countries. (Bloomberg

The AI boom has an unlikely early winner: wonky consultants
It seems every company out there is thinking about how to use AI. But the problem is that nobody is sure exactly how to do that. And so in come consultants, who are profiting from AI FOMO. Work related to generative AI will make up about 40% of McKinsey’s business this year. (The New York Times)

Deepfake creators are revictimizing sex trafficking survivors
A new low: For the past few months, the largest deepfake sexual abuse website has posted deepfake videos based on footage from GirlsDoPorn, a now-defunct sex trafficking operation. (Wired)

I paid $365.63 to replace 404 Media with AI
A journalist paid gig workers to use ChatGPT to plagiarize news. The result: grammatically correct nonsense. (404 Media)



Leave a Reply