Authors file a lawsuit against OpenAI for unlawfully ‘ingesting’ their books
Two authors have filed a lawsuit against OpenAI, the company behind the artificial intelligence tool ChatGPT, claiming that the organisation breached copyright law by “training” its model on novels without the permission of authors.
Mona Awad, whose books include Bunny and 13 Ways of Looking at a Fat Girl, and Paul Tremblay, author of The Cabin at the End of the World, filed the class action complaint to a San Francisco federal court last week.
ChatGPT allows users to ask questions and type commands into a chatbot and responds with text that resembles human language patterns. The model underlying ChatGPT is trained with data that is publicly available on the internet.
Yet, Awad and Tremblay believe their books, which are copyrighted, were unlawfully “ingested” and “used to train” ChatGPT because the chatbot generated “very accurate summaries” of the novels, according to the complaint. Sample summaries are included in the lawsuit as exhibits.
Related: The big idea: Should we worry about artificial intelligence?
This is the first lawsuit against ChatGPT that concerns copyright, according to Andres Guadamuz, a reader in intellectual property law at the University of Sussex. The lawsuit will explore the uncertain “borders of the legality” of actions within the generative AI space, he adds.
Books are ideal for training large language models because they tend to contain “high-quality, well-edited, long-form prose,” said the authors’ lawyers, Joseph Saveri and Matthew Butterick, in an email to the Guardian. “It’s the gold standard of idea storage for our species.”
The complaint said that OpenAI “unfairly” profits from “stolen writing and ideas” and calls for monetary damages on behalf of all US-based authors whose works were allegedly used to train ChatGPT. Though authors with copyrighted works have “great legal protection”, said Saveri and Butterick, they are confronting companies “like OpenAI who behave as if these laws don’t apply to them”.
However, it may be difficult to prove that authors have suffered financial losses specifically because of ChatGPT being trained on copyrighted material, even if the latter turned out to be true. ChatGPT may work “exactly the same” if it had not ingested the books, said Guadamuz, because it is trained on a wealth of internet information that includes, for example, internet users discussing the books.
OpenAI has become “increasingly secretive” about its training data, said Saveri and Butterick. In papers released alongside early iterations of ChatGPT, OpenAI gave some clues as to the size of the “internet-based books corpora” it used as training material, which it called only “Books2”. The lawyers deduce that the size of this dataset – estimated to contain 294,000 titles – means the books could only be drawn from shadow libraries such as Library Genesis (LibGen) and Z-Library, through which books can be secured in bulk via torrent systems.
This case will “likely rest on whether courts view the use of copyright material in this way as ‘fair use’”, said Lilian Edwards, professor of law, innovation and society at Newcastle University, “or as simple unauthorised copying.” Edwards and Guadamuz both emphasise that a similar lawsuit brought in the UK would not be decided in the same way, because the UK does not have the same “fair use” defence.
The UK government has been “keen on promoting an exception to copyright that would allow free use of copyright material for text and data mining, even for commercial purposes,” said Edwards, but the reform was “spiked” after authors, publishers and the music industry were “appalled”.
Since ChatGPT was launched in November 2022, the publishing industry has been in discussion over how to protect authors from the potential harms of AI technology. Last month, The Society of Authors (SoA) published a list of “practical steps for members” to “safeguard” themselves and their work. Yesterday, the SoA’s chief executive, Nicola Solomon told the trade magazine the Bookseller that the organisation was “very pleased” to see authors suing OpenAI, having “long been concerned” about the “wholesale copying” of authors’ work to train large language models.
Richard Combes, head of rights and licensing at the Authors’ Licensing and Collecting Society (ALCS), said that current regulation around AI is “fragmented, inconsistent across different jurisdictions and struggling to keep pace with technological developments”. He encouraged policymakers to consult principles that the ALCS has drawn up which “protect the true value that human authorship brings to our lives and, notably in the case of the UK, our economy and international identity”.
Saveri and Butterick believe that AI will eventually resemble “what happened with digital music and TV and movies” and comply with copyright law. “They will be based on licensed data, with the sources disclosed.”
The lawyers also noted it is “ironic” that “so-called ‘artificial intelligence’” tools rely on data made by humans. “Their systems depend entirely on human creativity. If they bankrupt human creators, they will soon bankrupt themselves.”
OpenAI were approached for comment.