AI training and content piracy – issues and prospects

In the first part of this article we reported that several major artificial intelligence models have been trained—at least partially—on copyrighted content obtained illegally. This practice, though largely concealed until recently, is now the subject of international litigation.

This second part explores the ethical implications of this phenomenon, the still-timid responses from lawmakers, and potential avenues for developing AI training practices that are fairer, more transparent, and more respectful of creators.

A Major Ethical Breach

The unauthorized use of protected works in the development of advanced technologies represents a deep ethical flaw. Unlike marginal uses (such as citation or parody), AI systems exploit works in their entirety, often for commercial purposes and on an industrial scale. This raises several issues:

Lack of consent: Authors, publishers, researchers, and artists were neither consulted nor informed about the use of their content.
Disconnection between exploitation and compensation: While AI generates profits for the companies that commercialize it, rights holders receive no compensation.
Blurring of traceability: Once ingested, works become invisible within the training corpus; it is nearly impossible to tell whether a generated text is influenced by a specific author.

Beyond legal questions, this is a philosophical challenge to the concept of intellectual property. Copyright law is based on the idea that creation involves intellectual, emotional, and often financial investment that deserves recognition and protection. AI, by absorbing these human productions without permission, undermines this conception in favor of an extractivist paradigm inherited from the platform economy.

The Position of Big Tech

In response to criticism, some tech companies have adopted a defensive strategy based on several arguments:

Technological progress as justification: They argue that training on the maximum amount of data is a necessary condition for developing models useful to society (translation, medicine, accessibility, etc.).
Fair use as a legal basis (in U.S. law), even though it is challenged in many jurisdictions.
Dilution of responsibility: Some claim not to have precise knowledge of the sources used, especially when data comes from subcontractors or intermediate public databases.
The rise of open-source models as a democratizing force: Companies like Meta promote free access to their models to justify greater tolerance of their training practices.

However, this defense is increasingly fragile. It deliberately ignores the core principles of copyright law and relies on a utilitarian logic, where the potential benefit to the majority is used to justify the harm done to individual creators.

Legal Responses Underway

On the litigation front, lawsuits are multiplying across several countries. In the United States, class action suits against OpenAI, Meta, and Stability AI aim to establish jurisprudence protecting authors. In Europe, national courts are beginning to take positions. Noteworthy developments include:

An ongoing complaint in France against Meta, discussed in the first part of our article.
Appeals at the level of the European Parliament, where the AI Act could include provisions on the origin of training data.

Some legislators advocate for the introduction of a mandatory licensing scheme for AI training, similar to what exists for photocopying or radio broadcasting. This would legalize existing practices while ensuring some redistribution to rights holders.

Toward Transparency Requirements?

Another possible approach concerns transparency in training datasets. Today, most models are “black boxes” when it comes to their data sources. Without clear information on what was used, it is difficult for creators to assert their rights.

Proposals are emerging to require AI developers to:

Publish a full or representative list of works used for training.
Provide an interface allowing rights holders to verify whether their content was used.
Enable a clear, simple, and accessible opt-out mechanism.

This transparency would not solve all problems (especially regarding past data usage), but it would be a first step toward a fairer and more responsible model.

Rethinking the Value Chain

The debate on AI and pirated content is not merely a legal or moral issue. It profoundly questions the value chain in the digital economy. If works can be absorbed by machines with no compensation, what residual value does human creativity have in the 21st-century economy?

Two pitfalls should be avoided:

Technophobia, which sees AI as a systematic threat.
Technological solutionism, which dismisses ethical concerns in the name of innovation.

A balanced path is possible: it involves recognizing the role of creators, establishing fair compensation frameworks, and integrating cultural rights into technology governance.

Several options are currently being considered at the international level:

Requiring developers to publish the exact sources of their training datasets
Setting up a mandatory collective licensing mechanism, similar to those used for music or television
Creating a clear opt-out right for authors who do not wish their works to be used
Establishing an automatic royalty system, redistributed to rights holders through collective management organizations

Training artificial intelligence on pirated content raises complex legal, economic, political, and ethical issues. While some companies may have felt above the law amid the recent wave of tech enthusiasm, it is now clear that a rebalance is necessary.

The tools exist: transparency obligations, licensing mechanisms, withdrawal rights, and legislative frameworks. But a strong political will, both national and international, is needed to enforce them. The future of creativity, justice, and responsible innovation depends on it.

Join us in June for our next topic: cryptocurrencies.
In the meantime, if you need to protect a film, series, software, or e-book, don’t hesitate to contact one of our account managers; PDN has been a pioneer in cybersecurity and anti-piracy for over ten years, and we’re sure to have a solution to help you. Happy reading and see you soon!

Share this article

A Major Ethical Breach

The Position of Big Tech

Legal Responses Underway

Toward Transparency Requirements?

Rethinking the Value Chain

SERVICES

Useful information