Large language models like OpenAI’s GPT-3 and Google’s GShard learn to write humanlike text by internalizing billions of examples from the public web. Drawing on sources like ebooks, Wikipedia, and social media platforms like Reddit, they make inferences to complete sentences and even whole paragraphs. But a new study jointly published by Google, Apple, Stanford University, OpenAI, the University of California, Berkeley, and Northeastern University demonstrates the pitfall of this training approach. In it, the coauthors show that large language models can be prompted to show sensitive, private information when fed certain words and phrases.
It’s a well-established fact that models can “leak” details from the data on which they’re trained. Leakage, also known as data leakage or target leakage, is the use of information in the training process that couldn’t be expected to be available when the model makes predictions. This is of particular concern for all large language models, because their training datasets can sometimes contain names, phone numbers, addresses, and more.
In the new study, the researchers experimented with GPT-2, which predates OpenAI’s powerful GPT-3 language model. They claim that they chose to focus on GPT-2 to avoid “harmful consequences” that might result from conducting research on a more recent, popular language model. To further minimize harm, the researchers developed their training data extraction attack using publicly available data and followed up with people whose information was extracted, obtaining their blessing before including redacted references in the study.
By design, language models make it easy to generate an abundance of output. By seeding with random phrases, the model can be prompted to generate millions of continuations, or phrases that complete a sentence. Most of the time, these continuations are benign strings of text, like the word “lamb” following “Mary had a little…” But if the training data happens to repeat the string “Mary had a little wombat” very often, for instance, the model might predict that phrase instead.
The coauthors of the paper sifted through millions of output sequences from…