Meta AI Model Reproduces Nearly Half of Harry Potter Book

Introduction to the Issue

The Google Books precedent probably can’t protect Meta against this second legal theory because Google never made its books database available for users to download—Google almost certainly would have lost the case if it had done that. In principle, Meta could still convince a judge that copying 42 percent of Harry Potter was allowed under the flexible, judge-made doctrine of fair use. But it would be an uphill battle.

Understanding Fair Use

“The fair use analysis you’ve gotta do is not just ‘is the training set fair use,’ but ‘is the incorporation in the model fair use?’" Lemley said. "That complicates the defendants’ story.” This means that the court will have to consider not only whether the initial data collection was fair use but also whether the final product, which incorporates that data, is also fair use.

Open-Weight Models and Legal Jeopardy

Grimmelmann also said there’s a danger that this research could put open-weight models in greater legal jeopardy than closed-weight ones. The Cornell and Stanford researchers could only do their work because the authors had access to the underlying model—and hence to the token probability values that allowed efficient calculation of probabilities for sequences of tokens. Most leading labs, including OpenAI, Anthropic, and Google, have increasingly restricted access to these so-called logits, making it more difficult to study these models.

Closed-Weight Models and Copyright Protection

Moreover, if a company keeps model weights on its own servers, it can use filters to try to prevent infringing output from reaching the outside world. So even if the underlying OpenAI, Anthropic, and Google models have memorized copyrighted works in the same way as Llama 3.1 70B, it might be difficult for anyone outside the company to prove it. Moreover, this kind of filtering makes it easier for companies with closed-weight models to invoke the Google Books precedent. In short, copyright law might create a strong disincentive for companies to release open-weight models.

The Perverse Outcome of Copyright Law

“It’s kind of perverse,” Mark Lemley told me. “I don’t like that outcome.” On the other hand, judges might conclude that it would be bad to effectively punish companies for publishing open-weight models. “There’s a degree to which being open and sharing weights is a kind of public service,” Grimmelmann told me. “I could honestly see judges being less skeptical of Meta and others who provide open-weight models.”

Conclusion

The issue of open-weight models and copyright law is complex and multifaceted. While the Google Books precedent may provide some guidance, it is unclear how courts will ultimately decide these cases. One thing is certain, however: the outcome of these cases will have significant implications for the development of AI and the dissemination of knowledge.

FAQs

Q: What is the Google Books precedent?
A: The Google Books precedent is a legal decision that allows for the copying of copyrighted works for certain purposes, such as search and research.
Q: What is the difference between open-weight and closed-weight models?
A: Open-weight models make their underlying data and models available for public use, while closed-weight models keep this information proprietary.
Q: Why might copyright law create a disincentive for companies to release open-weight models?
A: Copyright law might create a disincentive for companies to release open-weight models because these models are more vulnerable to copyright infringement claims.
Q: What is the potential outcome of these cases for the development of AI?
A: The outcome of these cases will have significant implications for the development of AI, potentially shaping the way that companies approach data sharing and model development.