Last month, The Atlantic reported that generative AI by Bloomberg, Meta, and other big tech companies have been trained using a dataset of 191,000 books without the authors’ permissions. Called “Books3”, this dataset of pirated books includes books published in the past 20 years and by many African authors such as Wayétu Moore, Laila Lalami, and more.
Let’s break this down a bit so that we can understand the root of the problem. Generative AI is artificial intelligence capable of generating text, images, or other media, using models that learn the patterns of their input training data and then generate new data with similar characteristics. Generative AI includes large language models like ChatGPT, which is trained using massive amounts of data to respond to users’ prompts.
According to The Atlantic, generative AI like ChatGPT is trained using text from Wikipedia and other online writing, but “high-quality generative AI requires higher-quality input than is usually found on the internet—that is, it requires the kind found in books.”
This is extremely concerning for the publishing industry and many authors have begun to sue with a class action lawsuit on the grounds of violation of copyright laws ever since the news came out.
Many African writers displayed concern by the news of the violation and shared screenshots showing that their copyrighted works were part of the dataset. They have taken a stand and clearly emphasized their position as not supporting this pirating of books.
Moroccan-American author Laila Lalami was outraged that her work was used without any permission:
Two of my novels (The Moor’s Account and The Other Americans) were used to train AI for Meta, Bloomberg, and other billion-dollar corporations. I never consented, was never compensated, etc. It’s really infuriating. I’m glad the Authors’ Guild is suing. https://t.co/271iAcvLvs
— Laila Lalami (@LailaLalami) September 27, 2023
Similarly, Liberian-American author Wayétu Moore posted a screenshot of her name and novel on the Books3 dataset, showing concern about the state of the literary publishing industry:
Without consent, without compensation, 180,000 books are currently being used to teach generative AI entities how to write, how to mimic the technique and style of authors, standing to eventually make their founders enormous profit. This on the backs of thousands of artists. I read an interview by Scorcese this week and he’s quoted saying “the industry is over” referencing contemporary Hollywood. If this goes unchallenged, literature changes forever.
View this post on Instagram
We are utterly outraged by this violation of authors’ rights and copyright laws. We do not support this illegal pirating of books to train generative AI and hope that justice will be served.
The Atlantic has provided a search tool, enabling people to look up author names to see if their works are part of the dataset. Search for books here.