One of the main defenses used by those who are optimistic about AI art generators is that although the models are trained on existing images, everything they create is new. AI evangelists often compare these systems to real artists. Creatives are inspired by everyone who came before them, so why can’t AI evoke past work in the same way?
New research could put a damper on this argument and could even become a major sticking point for several ongoing lawsuits regarding AI-generated content and copyright. Researchers from industry and academia have found that the most popular and upcoming AI image generators can “remember” images from the data they are trained on. Instead of creating something completely new, some prompts will force the AI to simply reproduce an image. Some of these recreated images may be copyrighted. But even worse, modern generative AI models have the ability to memorize and reproduce sensitive information retrieved for use in an AI training set.
The study was conducted by researchers in the technology industry, in particular Google and DeepMind and at universities like Berkeley and Princeton. The same team worked on a previous study who identified a similar problem with AI language models, in particular GPT2, the precursor to OpenAI Extraordinarily Popular ChatGPT. Bringing the group together, researchers led by Google Brain researcher Nicholas Carlini discovered that Google’s Imagen and the popular open source Stable Diffusion were able to reproduce images, some of which had obvious anti-copyright implications. or image licenses.
The first image in this tweet was generated using the caption listed in the dataset of Stable Diffusion, the multi-terabyte scraped image database known as LAION. The team fed the caption into the Stable Diffusion prompt, and came out with the exact same image, albeit slightly distorted by digital noise. The process of finding these duplicate images was relatively simple. The team ran the same prompt multiple times, and after getting the same resulting image, the researchers manually checked whether the image was in the training set.
G/O Media may receive a commission
Two of the paper’s researchers, Eric Wallace, a PhD student at UC Berkeley, and Vikash Sehwag, a PhD candidate at Princeton University, told Gizmodo in a Zoom interview that image duplication was rare. Their team tried about 300,000 different subtitles and only found a recall rate of 0.03%. Copied images were even rarer for models like Stable Diffusion which worked to deduplicate images in its training set, although ultimately all diffusion models will have the same problem, to a greater or lesser degree tall. The researchers found that Imagen was perfectly capable of memorizing images that only existed once in the dataset.
“The caveat here is that the model is meant to generalize, it’s meant to generate new images rather than spit out a memorized version,” Sehwag said.
Their research showed that as AI systems themselves become larger and more sophisticated, the AI is more likely to generate copied material. A smaller model like Stable Diffusion just doesn’t have the same amount of storage space to store most of that workout data. That could change a lot in the next few years.
“Maybe next year, whatever the new model is much bigger and much more powerful, then potentially those types of memorization risks would be much higher than they are now,” Wallace said.
Through a complicated process of destroying training data with noise before removing that same distortion, diffusion-based machine learning models create data, in this case, images, similar to the ones on which they have been formed. Diffusion models were an evolution of generative adversarial networks or GAN-based machine learning.
Researchers have found that GAN-based models don’t have the same problem with image retention, but large companies are unlikely to go beyond Diffusion unless an even more machine learning model sophisticated only produces even more realistic and high-quality images.
Florian Tramèr, professor of computer science at ETH Zurich who participated in the research, noted how many AI companies advise users, both those of the free and paid versions, to obtain a license to share or even monetize the AI-generated content. The AI companies themselves also reserve some of the rights to these images. This could be a problem if the AI generates an image that is exactly the same as an existing copyright.
With only a memorization rate of 0.03%, AI developers could look at this study and determine that there isn’t much risk. Companies could work to de-duplicate images in training data, making them less likely to memorize. Hell, they could even develop AI systems that would detect if an image is a direct replica of an image in the training data and flag it for deletion. However, this masks the total privacy risk posed by generative AI. Carlini and Tramèr also took part in another recent article who argued that even attempts to filter the data still do not prevent training data from leaking through the model.
And of course, there is a high risk that images that no one wants to be copied end up appearing on users’ screens. Wallace asked if a researcher wanted to generate a whole series of synthetic medical data about people’s X-rays, for example. What should happen if a broadcast-based AI memorizes and replicates a person’s actual medical records?
“It’s pretty rare, so you might not notice it happening at first, and then you might actually deploy that dataset to the web,” the UC Berkeley student said. “The whole point of this work is to sort of anticipate the kinds of possible mistakes that people might make.”