Home AI Sorry, LLMs – Congress Might Make It A Whole Lot Harder To Train On Copyrighted Content

Sorry, LLMs – Congress Might Make It A Whole Lot Harder To Train On Copyrighted Content

SHARE:

AI bots are hungry.

They’re scraping information found on passports and credit cards and training on novels without authors’ consent. Even fanfiction has been used to train some (presumably quite nerdy) bots.

But a bill proposed in the Senate a few weeks ago could change that.

In late July, Sens. Josh Hawley (R-Mo.) and Richard Blumenthal (D-Conn.) introduced a bipartisan bill called the AI Accountability and Personal Data Protection Act. If passed, the bill would mandate stricter regulations for AI training on copyrighted material, including establishing a federal tort (i.e., a harmful act with legal liability) for misuse of personal data and determining specific remedies for damages.

While the bill protects data and materials that belong to individuals, rather than larger entities, publishers and other businesses that work with creators would be affected, too, if they had published any work copyrighted by an individual.

The material in question ranges from personal data, like browsing history and IP addresses, to copyrighted materials, like books and paintings.

You can read the full text of the bill here.

I’m just a bill

The question at the heart of the bill is whether using copyrighted material for LLM training is fair use, said Chris Mammen, a partner at law firm Womble Bond Dickinson LLP, told AdExchanger.

The bill doesn’t seek to displace previous rulings or suggest that there’s no such thing as fair use of copyrighted material, Mammen explained.

“Fair use” is a broad term referring to any permissible use of copyrighted works without needing a license or explicit consent from the creator, often for purposes like research and reporting.

Subscribe

AdExchanger Daily

Get our editors’ roundup delivered to your inbox every weekday.

Now, lawmakers want to create a clearer definition of what is and isn’t deemed fair use, as well as a baseline for penalties if personal data or copyrighted material is used unjustly. The bill would call for compensation equal either to the actual financial loss suffered, three times any profit made from exploiting the data or $1,000 – whichever of those three is greatest.

Rules and regulations

The good news is that there’s already a four-factor test laid out in the Copyright Act of 1976 to determine whether a given way of using someone else’s work is fair use – which means it should be simple enough to figure out how to apply this to individual use cases, right?

Apparently not. As it turns out, Mammen said, “it’s not a very easy question, given the way the four factors are articulated in the statute.”

Those four components include:

  • the nature of the use (whether it’s commercial or personal, or taken verbatim from the source or paraphrased);
  • the nature of the work (creative or factual);
  • how much of the copyrighted work is used (which can be interpreted to mean how much of a work was input into an LLM or how much of an original product was used in the output);
  • and the market impact (i.e., whether it’s basically a knockoff and displacing a preexisting work).

Interpreting these factors isn’t exactly cut and dried. In June, Mammen said, two judges “reached starkly different conclusions” regarding the market impact of LLMs training on copyrighted works.

In Bartz v. Anthropic, Judge William Alsup, a district judge for the Northern District of California, determined that the training was fair use, since the LLM in question had not generated a knockoff or imitation of the original books in question.

However, in Kadrey v. Meta Platforms, Inc., Judge Vince Chhabria (who also practices in the Northern District of California) proposed several ways that LLM training could potentially harm the market, including the fact that if AI models eventually create similar works due to training on the originals, that would inherently create competition and the potential to replace them.

Bills to pay

But although almost everyone believes that content creators are “entitled to some sort of compensation,” Mammen said, and have the right to require permission before their content is used to generate new, similar works, the existence of a vague moral obligation isn’t enough to establish legal rights or protections in court.

Still, some progress is being made on the compensation front. In June, the IAB Tech Lab proposed a new initiative that would offer publishers more control over how LLMs use their content and how they would be paid.

Around the same time, Cloudflare implemented a new model to block AI crawlers from accessing content without express permission and a preset form of payment.

But what about data that shouldn’t be used at all, even for a fair price?

The sincerest form of flattery?

The way that generative AI processes data isn’t really comparable to the way that humans, or even other machines, have used existing content in the past – hence the need for new regulations.

Historically, data has been primarily used for analytics or automated decision-making, said Mammen, rather than generating new content.

While creating art from someone else’s creative outputs isn’t a new concept – “like cover bands,” he said, “or people who make new art in the style of somebody else” – generative AI brings it to a new level.

What’s really “giving us some pause,” said Mammen, “is the fact that AI can do it at scale with great fidelity and with great speed.”

Still, despite the hesitations voiced by attorneys and lawmakers alike, the future of this bill remains to be seen.

Once a bill is introduced, it’s a “long, long journey to the capital city” – and there’s no guarantee it will become a law, especially considering the current administration’s goal to eliminate “bureaucratic red tape” around AI development.

But although only a first step, this bill takes the widely held belief that creators deserve control over the use of their work and turns it into a concrete call to action.

Must Read

Jamie Seltzer, global chief data and technology officer, Havas Media Network, speaks to AdExchanger at CES 2026.

CES 2026: What’s Real – And What’s BS – When It Comes To AI

Ad industry experts call out trends to watch in 2026 and separate the real AI use cases having an impact today from the AI hype they heard at CES.

New Startup Pinch AI Tackles The Growing Problem Of Ecommerce Return Scams

Fraud is eating into retail profits. A new startup called Pinch AI just launched with $5 million in funding to fight back.

Comic: Shopper Marketing Data

CPG Data Seller SPINS Moves Into Media With MikMak Acquisition

On Wednesday, retail and CPG data company SPINS added a new piece with its acquisition of MikMak, a click-to-buy ad tech and analytics startup that helps optimize their commerce media.

Privacy! Commerce! Connected TV! Read all about it. Subscribe to AdExchanger Newsletters

How Valvoline Shifted Marketing Gears When It Became A Pure-Play Retail Brand

Believe it or not, car oil change service company Valvoline is in the midst of a fascinating retail marketing transformation.

AdExchanger's Big Story podcast with journalistic insights on advertising, marketing and ad tech

The Big Story: Live From CES 2026

Agents, streamers and robots, oh my! Live from the C-Space campus at the Aria Casino in Las Vegas, our team breaks down the most interesting ad tech trends we saw at CES this year.

Monopoly Man looks on at the DOJ vs. Google ad tech antitrust trial (comic).

2025: The Year Google Lost In Court And Won Anyway

From afar, it looks like Google had a rough year in antitrust court. But zoom in a bit and it becomes clear that the past year went about as well as Google could have hoped for.