Artificial Intelligence vs. Intellectual Property
Much of AI is not about generating new knowledge, but about distributing existing knowledge in new form. That creates copyright conflicts. And it threatens how humans will collaborate in the future.
Suchir Balaji, an AI researcher, worked for four years at OpenAI. He resigned in August 2024 citing ethical and legal concerns about OpenAI’s stance on intellectual property. In November 2024, he was found dead in his apartment in San Francisco. The death was ruled suicide by gunshot.
The case briefly made the news and put Balaji’s concerns in the spotlight. Even public figures like Elon Musk and Tucker Carlson commented on suspicious circumstances of his death. But today, just half a year later, nobody is talking anymore about unresolved intellectual property issues in the AI industry.
The problem is stewing though and every one of us better get our act together where we stand on it. Legally because trillions of Dollars of value depend on it. And ethically because how we coexist and cooperate in the future depends on it. So, here’s my contribution to a debate that is as fascinating as it is necessary.
TLDR Summary
AI is surrounded by a mystical aura, an emerging new form of intelligence that has arrived to accelerate innovation and drive the economy. An important truth is however much of AI’s utility today is not about generating new knowledge, but instead feeding people with existing knowledge in a more convenient and tailor-made way than before.
For this business model to work, it is important that much of the redistributed knowledge is not paid for. This can however be problematic from a copyright perspective when said knowledge is owned by someone in the form of intellectual property.
To do so legally, the AI industry is hiding behind the so called fair use principle which allows the distribution of other people’s IP without their consent under certain circumstances. The validity of the fair use principle is however questionable in the case of deep learning models, especially due to the scale at which they are operating.
All of this could become a major legal debate once it gets enough traction. The ethical and social dimension is however much more important. AI tools are centralizing knowledge building. Much of what happened before in a decentralized manner solely between internet users is now happening between those users and closed-up AI aggregators. The result could be that the rise of AI impedes human collaboration rather than accelerating it.
What is fair use?
The fair use principle is codified in US copyright law to govern the use of copyrighted materials. It exists to balance the rights of creators with the public interest in access to information, education and expression. I can imagine that similar principles exist in other jurisdictions. However, I am going to focus on the US which is the AI frontier and global standard setter anyway.
To achieve its goal of balancing the above interests, the fair use principle allows limited use of copyrighted material without permission of the copyright owner, especially for criticism, commentary, news reporting, teaching, research and parody.
Whether the use of copyrighted material is fair use or not is typically determined based on subjective judgement on a case by case basis. There is not a single set of yes/no combinations on objective criteria that has to be met. Instead there are guiding principles that can support a decision. Here are some of those principles:
Is the use transformative? If a new perspective is added to the material, it is more likely to qualify as fair use under the criticism/commentary angle.
Is the use commercial? If the user intends to make money of the material, the use is less likely fair.
How much of the original work is used? If most of the user’s publication is someone else’s work, it’s less likely to be fair.
Is the original material factual or creative? The more creative, the less likely fair.
Does the use help or harm the copyright owner? If it’s marketing for them, it’s more likely fair. If it’s removing potential buyers from them, it less likely to be fair.
What was Suchir Balaji’s perspective?
Balaji had worked for OpenAI for four years before he left the firm and voiced his criticism publicly in his essay When does generative AI qualify for fair use?, published in October 2024. It’s a 1,500 word essay. I highly recommend you to spend the 10min to read it complementary to this article because I believe he makes some important points.
In this essay, Balaji argued that OpenAI could not possibly rely on the fair use principle due to the scale of its copying, the lack of transformation and the diminished demand for the original works.
Specifically with respect to transformation he points to the fact that most LLMs are geared towards low entropy, that is low output uncertainty. This is because low entropy increases the coherence of outputs and reduces hallucination which generally improves the user experience. The disadvantage is that low entropy models tend to simply regurgitate their training data (i.e. the potentially copyrighted material).
For the AI developer, the trade-off is this: The more entropy you gear your model to, the more noise you create in the output and the more you can obfuscate the fact that you are simply presenting someone else’s content as your own. And the less entropy you allow, the closer your results get to the original work which ensures more coherence.
Personally I would argue that this doesn’t even address the central point of the issue. Even if your outputs differ from the original (potentially copyrighted) material, you are still using it. Just randomizing the output doesn’t change the fact. Imagine you rewrote the entire Harry Potter series and just changed the names of the characters to random other names. It would obviously still be copyright infringement.
Not all transformation is created equal. If the sole purpose of a transformation is to hide where the information comes from, it can’t possibly be fair use in my opinion. It’s a gaming of the principle that goes against the original idea. Every university professor would fail you if you simply paraphrased what others said before.
Below is an exchange between me and ChatGPT on whether it uses Substack as a source to find answers to my questions. It confirms that it does:
Then I asked whether Substack writers get compensated for that. It also confirms that this is not the case:
The problem with this response is that not all public web content has the same purpose or the same IP protection. Even if a Substack writer publishes a free article, they are not doing that out of altruism. They want to advertise their skills and their creativity to attract readers, subscribers or customers for adjacent products. When OpenAI uses their content to present it without attribution to a ChatGPT pro subscriber to justify the subscription fee, that is theft. Plain and simple. And in that case it doesn’t matter if they sprinkle some entropy into the output to make it look novel.
Are humans copyright infringement machines?
I have debated this topic with other people before. The most common objection to my point of view is the argument that humans are copyright infringement machines just like large language models. All information and knowledge we have is obtained from somewhere. Very little of what we know has not been learned from another person. We would however never compensate every one of them when we use the knowledge we have from them to make money. An author would never pay royalties to his high school literature teacher.
I can see where that argument comes from. It’s difficult to draw the line between where you use someone else’s knowledge and when you can rightfully call it your own. And the debate on differences between human knowledge acquisition and artificial knowledge acquisition can become a slippery slope as well.
However, we actually don’t necessarily need that debate in my opinion. It’s not just a question of principle, but also about the magnitude of the problem from a financial perspective. If a human gathers knowledge from another human and uses that to make money, they are typically limited in terms of how much money they can earn from it, simply based on how many waking hours they have and how many people they can engage with. In contrast, LLMs can churn through vast amounts of copyrighted material in seconds and distribute that globally. Copyright infringement by an AI is therefore potentially much more harmful for the legal owners than copyright infringement by a person.
This problem shares similarities with the music piracy of the 1990s and early 2000s. Back then, the rise of the internet made it possible for music consumers to download, copy and share large amounts of music without permission. This became a big problem that cost musicians billions of Dollars. Tons of lawsuits were filed, many of which were successful.
But why did this problem occur in the 1990s? Why not in the 1980s? Even before the internet, you could also record music at home, put that on a tape and share it with other people, perhaps even to make some money doing so. The difference is that it didn’t happen at a scale that was painful enough for musicians to bother. Copying cassettes was slow, it degraded quality and its physical nature limited the reach for the copies.
Take a look at the cover picture I chose for this article. It’s a robot as a defendant in court as a symbol for the conflict between AI model and digital content producers. That picture is seemingly novel. It hasn’t existed before I prompted ChatGPT to draw it for me. But where did ChatGPT get the idea for this quite peculiar design? It probably mixed it together from various other sources, obviously without attribution.
You could say that it just does what humans do, too. Every artist stands on the shoulders of those before. But humans don’t copy each other at the scale and with the industrial approach that AI do.
Sincerely,
Rene
PS: As you may have noticed I dropped the disclaimer that typically precedes my articles. I got scared into that by other people. It has always been an eye sore. I am confident you know very well that nothing what I do here is financial advice even though I publish on the economy and financial markets. :)
Dear SIr,
As a researcher in generative AI and an ethicist working at the intersection of cognitive technology, copyright, and public trust, I found this piece both poignant and prescient.
Your concern over the erosion of intellectual integrity under the guise of "artificial intelligence" is shared by many of us who see how opacity, scale, and a lack of enforceable provenance frameworks are eroding not only copyright norms but the philosophical ground upon which creative authorship stands.
The core issue you raise, the untraceable intermixing of genuine intellectual labour and machine-hallucinated content, is exacerbated by a legal landscape not yet equipped to parse intention, origin, or fair use at the computational scale models like Midjourney and Stable Diffusion now operate.
The lawsuits involving these platforms, particularly in the Southern California DMCA cases, underscore the risk of precedent law establishing liability chains so deep and diffuse that downstream users: collectors, NFT buyers, even curators, may suffer catastrophic loss.
If a takedown is issued on platforms like OpenSea or MagicEden due to provenance violation, the assets (often purchased in good faith) risk becoming permanently stranded, value locked, art orphaned, and trust in decentralized systems destroyed.
In the short term, we may need to adopt verifiable data chain registries for training sets and output tracking
Essentially ethical audit trails. Solutions like C2PA (Coalition for Content Provenance and Authenticity) show promise when expanded beyond photojournalism. Mid-term, the solution must be infrastructural: embed watermarking, disclosure protocols, and regulatory-compliant opt-in datasets at the model architecture level.
This is not just a technical fix but a moral imperative to avoid continued digital enclosure of human creativity under ambiguous licensing pretenses.
Your call for discernment is timely. Without it, we risk not only losing the rights of the original creator but also confusing the essence of creation itself.
Warm regards,
Graham dePenros
COO, Open Code Mission
GenAI Ethics and AI/Neuro Ethics SME
Great topic and article! I do know that companies like Reddit restricts the use of its data for training new AIs through its Developer Terms of Service, prohibiting developers from using Reddit data to train LLMs without explicit permission.
Now I wonder if Substacks can easily be eaaily be taken?