Artificial intelligence, particularly in its form based on Large Language Models (LLMs), relies on the ingestion of very large amounts of textual data. However, revelations have shown that some of these models have been trained on copyrighted content without prior authorization. The use of pirated content, particularly books from databases such as LibGen, raises questions about respect for the law, the sustainability of technological development, and the place of authors in this new ecosystem.
The first part of our May series examines the facts that have recently come to light, particularly around the Meta/LibGen affair, tracing the motivations, technical procedures, and initial legal disputes.
In March 2025, court documents revealed that Meta was using pirated content to train its LLaMA (Large Language Model Meta AI) model. The company allegedly used data from Library Genesis (LibGen), a platform known in academic circles for providing free access to digital books, the vast majority of which were downloaded illegally.
The amount of data involved is significant: approximately 183,000 books were used, representing nearly 32 TB of information. Internal communications at Meta, revealed by The Atlantic and other specialist media, show that the company was well aware of the illegal nature of these sources. Members of the legal team reportedly warned of the legal risks, but management authorized the use of this content in order to “not fall behind” competitors such as OpenAI and Anthropic.
There are three main reasons for this drift.
Tech giants often rely on the U.S. concept of fair use to justify the use of protected content. This concept allows, under certain conditions, use without authorization for purposes such as research, commentary, or parody. However, this legal basis is being challenged in several ongoing cases, notably because:
Fair use is an exception clause, assessed on a case-by-case basis by the courts;
Its application in the field of AI remains largely undetermined;
It applies only within the United States and is incompatible with European legislation.
In France, three representative organizations have taken legal action:
SNE (Syndicat national de l’édition – National Publishing Union),
SGDL (Société des gens de lettres – Society of Literary Authors),
SNAC (Syndicat national des auteurs et des compositeurs – National Union of Authors and Composers).
They denounce the unauthorized appropriation of protected works, often available in bookstores or official digital libraries.
Many authors, sometimes without international recognition, have discovered that their books were included in the training datasets used by certain AI systems. Citizen-led initiatives have made it possible to cross-reference the metadata of AI models with those of pirated databases to identify the works involved.
The feeling of dispossession is real. Not only were the authors not consulted, but they are now seeing their creations being used to generate texts automatically—sometimes in their own style—without any compensation. Some describe this as a new form of intellectual property theft, where works are no longer simply copied or illegally distributed, but absorbed to feed tools that could end up competing with their very profession.
The cases involving Meta, OpenAI, and Stability AI are likely just the first in a long series. In the United States, class action lawsuits have been filed by writers, visual artists, and publishers. In Europe, several national courts are now addressing the issue, often in the absence of established case law.
The central question remains: can an AI be trained on a work without authorization, as long as it does not explicitly reproduce its content? The debate pits advocates of a functional interpretation (focused on the end result) against those who defend a strict patrimonial approach (according to which any use must be compensated).
The use of pirated content in the training of artificial intelligence models reveals deep tensions between technological innovation, copyright protection, and the economic sustainability of creative work. The case of Meta and its use of LibGen crystallizes these issues: it shows that the line between technical exploration and legal circumvention is now being crossed by some digital players, under the guise of efficiency and competitiveness.
Join us again in mid-May for the second part of this series, where we will explore the ethical consequences of these practices, the regulatory approaches being considered at the international level, and the prospects for moving toward a fairer model for AI training. In the meantime, if you have a film, a series, software, or an e-book to protect, don’t hesitate to contact one of our account managers. PDN has been a pioneer in cybersecurity and anti-piracy for over ten years, and we’re sure to have a solution that can help you. Enjoy your reading, and see you soon!
Share this article
© 2023 PDN Cyber Security Consultant. All rights reserved.