The fuel of ChatGPT and co. is finished, the battle for data can begin

In the course of 2024, which is actually in the coming weeks, the stock of good quality texts on the internet will be completely exhausted: they will all have been used to build the largest AI systems, the so-called Large Language Models (LLMs). or train large language models. This is predicted by a study by the Epoch AI research institute. If the major AI companies such as OpenAI, Google, Anthropic and Meta do not acquire more data, development is in danger of coming to a standstill.

Each successive generation of LLMs, the AI technology behind chatbots like ChatGPT, is more powerful than the last, but also requires more and more data to train with. 2019 GPT-2 was trained on 40 gigabytes of text, or about 8 million pages. A year later, GPT-3 already needed 570 gigabytes of text. And GPT-4? We don’t know: OpenAI has stopped disclosing such details about its technology.

What OpenAI, Google, Anthropic and the other AI companies are also extremely discreet about is what data they use. Not to help each other, but also not to wake sleeping dogs. Current models are most likely trained on copyrighted material, which in some cases should not be on the Internet at all (such as pirate copies of books). But it is technically almost impossible to find out with certainty.

Sarah Silverman’s book

Those dogs woke up last year. In a growing number of lawsuits, writers, illustrators and newspaper publishers are seeking compensation because they suspect that their work appears in GPT-4 training materials and other methods.dellen sits. By the way, it’s about much more than just text. Image generators – another form of generative AI – are also under fire. And despite the name, the large ‘language models’ are also trained on more and more images, audio and video.

Writer/comedian Sarah Silverman has demonstrated that ChatGPT can provide a detailed summary of her book Bedwetter. This means that her book has undoubtedly been used as training material, which may be copyright infringement. The New York Times filed a similar complaint against OpenAI and Microsoft.

Getty Images, which owns the rights to an immense collection of photos, believes Stability AI trained its Stable Diffusion image generator on images from that collection.

What is at stake in the lawsuits is who will benefit from AI revenues in the coming years. According to a study by McKinsey, technology companies can expect a turnover increase of 4.8 to 9.3 percent, which could amount to an additional $240 to $460 billion per year. Will some of that money go to writers, artists, illustrators and photographers? Or will these professional groups only become victims because AI systems make it possible to produce faster and cheaper than they do?

“Training data is the fuel of AI,” says Thomas Wolf, head of scientific research at American-European AI company Hugging Face. For the time being, these companies pay little or nothing for that fuel.

An exception?

The lawsuits are not only about the training material of the AI models, but also about what they produce: can that be described as plagiarism? ChatGPT is able to edit out the text of some articles TheNew York Times to almost literally cough up, to which OpenAI responds that ChatGPT will never do that in normal use.

The AI companies argue that they have the right to use what they find on the internet. They appeal fair use, an exception to (American) copyright. The judges will have to decide whether they are right.

Things are a little different in Europe: Europe passed the AI Act, a regulation that legally regulates a number of aspects of AI. A crucial provision is that the owners of the major language models will be required to list what copyrighted material was used to train their systems.

That would at least provide clarity to the authors of the texts, images, audio and video used – but not yet money. They will be able to resist the use of their work in the future. Whether that work has also been used for current AI systems such as GPT-3.5 and GPT-4, and whether they will ever be compensated for it, is another question.

And who checks whether companies are honest in listing the data used? “How on earth are you going to enforce that?” sighs Jozefien Vanherpe, assistant professor at Citip (KU Leuven).

At the same time, the AI Act seems to open the door to unbridled data use by tech companies: the text makes it clear that AI training data falls under the ‘text and data mining‘exception to copyright. In short: even in Europe, despite the specific legislation, there is anything but clarity.

Copy or editing?

What actually happens when an AI system is trained with the text of a book? That is a technical question, but also a philosophical and a legal one.

The large language models are based on a decades-old concept from artificial intelligence: the neural network, a computer system inspired by the functioning of the human brain. GPT-4 essentially consists of billions of ‘weights’ or ‘parameters’, each representing the strength of the connection between two neurons in the network.

The neural network gradually begins to recognize patterns in the text with which it is trained, and thus learns to predict the next word better and better. Depending on whether this succeeds or fails, the weights are adjusted slightly.

But is there a ‘copy’ of a copyrighted work hidden somewhere in those billions of weights? And if the AI system itself later produces text that resembles that book, is it an adaptation?

Or is it more as if the AI system ‘reads’ a book, just like a human does, to draw inspiration from it for its own work? Author Stephen King (76) thinks more in that direction. “Would I forbid my stories from being taught – if that’s the right word – to computers? Not even if I could,” he wrote The Atlantic. Although he added that his concern about the impact of AI on the writing profession could have something to do with his advanced age.

But for the time being the simple answer is: this has not been resolved yet.

What if GPT-4 were to train GPT-5?

There is another solution to the increasing data hunger of ever-larger AI models: synthetic data. Simply put: you let GPT-4 generate millions of pages of text and then use it to train GPT-5. But in the meantime, there are a number of studies that show that AI models can go off the rails: the errors and flaws of the earlier model can be amplified. And that effect increases over time.

According to Thomas Wolf of Hugging Face, we will be able to use synthetic data for about half. You have to be careful: more and more text on the internet is now generated by ChatGPT, so that human and AI-generated texts can hardly be separated. This further increases the importance of high-quality data for AI companies.

The Napster Years

And in the meantime? For Pieter Jan Valgaeren, intellectual property researcher at Erasmus University Rotterdam, it is clear: the tech companies need more data, the rights holders want to see money, and so “there has to be a sit-down.”

That is now happening. Although still behind closed doors: the AI companies want at all costs to avoid placing a clear (and high) amount on the value of training data. For example, we do not know what OpenAI pays to train AI on the texts of the news agency Associated Pressthe German publishing group Springer or the French newspaper Le Monde.

Valgaeren compares the situation to what happened twenty years ago with digital music: software companies such as Napster used it for years without paying. Until the music industry and its lawyers came knocking. We now have paid streaming services such as Spotify that pay rights to the music labels for all streams.

“The first to pass the checkout will be the biggest and most powerful,” says Valgaeren. These are rarely the individual artists. They can be represented.

Hollywood actors and screenwriters reached an agreement with video producers last year to prevent their work and likeness from being used without their approval and without compensation. But not all artists have such powerful unions. “Some artists will be left out in the cold,” Valgaeren fears.

Quality

The publishers who are now the first to sign deals with OpenAI or Google have large archives of high-quality texts: articles that have been proofread by editors and meet the requirements of the medium – and where you can be confident that the information contained in them state is factually correct.

Because with training data, quality is even more important than quantity. What exactly constitutes high-quality data is difficult to define, says Thomas Wolf of Hugging Face – this should mainly be reflected in how good the trained model is.

The texts on the forum site Reddit are also considered to be of high quality and very valuable for training AI. In February, Google struck a deal with Reddit that would see it pay $60 million a year (a rare amount publicly disclosed).

Battle between major players

An interesting question is whether AI companies will try to outdo each other by claiming large, high-quality data collections exclusively for themselves. So that, for example, Google is not allowed to train with the archive of, but Open AI is The Washington Post. It is not known whether most deals so far are exclusive.

Because the battle for data is of course also a battle between the big players. Elon Musk has already announced that only his company xAI and no one else is allowed to train with the messages that people post on X, which is also his property.

A battle also seems to be breaking out around video. In addition to the GPT language models, OpenAI is developing the new video generator Sora, which creates video images from scratch based on a description. A journalist asked Mira Murati, technical director at OpenAI, whether Sora has been trained on YouTube videos. Murati was not given an answer: she claimed not to know where her company had obtained the videos. A Google executive responded with a stern warning: no one is allowed to train AI on video material from YouTube – except Google itself.

That could prove to be a huge asset for Google in the future. Because video, experts such as Belgian researcher Pieter Abbeel predict, will play a key role in the development of AI in the coming years. And so we immediately know where the next battle will take place.

The article is in Dutch

Tags: fuel ChatGPT finished battle data