Introduction to AI in Law
The use of Artificial Intelligence (AI) in the legal profession has been a topic of interest in recent years. Large Language Models (LLMs) have shown promise in performing various tasks, but their ability to do legal work in the real world is still being evaluated. New benchmarks have been developed to assess the models’ performance in legal and financial tasks, and the results are revealing.
Evaluating LLMs’ Performance
The Professional Reasoning Benchmark, published by ScaleAI, evaluated leading LLMs on tasks designed by professionals in the field. The study found that the models have critical gaps in their reliability for professional adoption, with the best-performing model scoring only 37% on the most difficult legal problems. This means that the model met just over a third of possible points on the evaluation criteria. The models frequently made inaccurate legal judgments, and if they did reach correct conclusions, they did so through incomplete or opaque reasoning processes.
Limitations of LLMs in Legal Work
According to Afra Feyza Akyurek, the lead author of the paper, "The tools actually are not there to basically substitute [for] your lawyer. Even though a lot of people think that LLMs have a good grasp of the law, it’s still lagging behind." The AI Productivity Index, published by Mercor, also found that the models have "substantial limitations" in performing legal work. The best-performing model scored 77.9% on legal tasks, meaning it satisfied roughly four out of five evaluation criteria. However, in fields where errors are costly, a model with such a score may not be useful at all.
Challenges in Evaluating LLMs
Professional benchmarks are a big step forward in evaluating the LLMs’ real-world capabilities, but they may still not capture what lawyers actually do. Jon Choi, a law professor at the University of Washington School of Law, notes that "these questions, although more challenging than those in past benchmarks, still don’t fully reflect the kinds of subjective, extremely challenging questions lawyers tackle in real life." Unlike math or coding, legal reasoning may be challenging for the models to learn due to the messy and ambiguous nature of real-world problems.
Why LLMs Struggle with Legal Reasoning
The law deals with complex problems that often have no right answer, and a lot of legal work isn’t recorded in ways that can be used to train the models. When it is, documents can span hundreds of pages, scattered across statutes, regulations, and court cases that exist in a complex hierarchy. Furthermore, LLMs may lack a mental model of the world, which is the ability to simulate a scenario and predict what will happen. This capability could be at the heart of complex legal reasoning, and the current paradigm of LLMs trained on next-word prediction may not be sufficient.
Conclusion
In conclusion, while LLMs have shown promise in performing various tasks, their ability to do legal work in the real world is still limited. The models have critical gaps in their reliability, and their performance on legal tasks is not yet satisfactory. Further research is needed to develop more effective benchmarks and to improve the models’ ability to reason like lawyers.
FAQs
- Q: Can LLMs replace human lawyers?
A: No, LLMs are not yet capable of replacing human lawyers. While they can perform certain tasks, they lack the ability to reason like lawyers and make accurate legal judgments. - Q: What are the limitations of LLMs in legal work?
A: LLMs have substantial limitations in performing legal work, including their inability to fully reason about problems like humans do and their lack of a mental model of the world. - Q: How are LLMs evaluated?
A: LLMs are evaluated using benchmarks that assess their performance on legal and financial tasks. These benchmarks are designed to simulate real-world scenarios and test the models’ ability to reason and make accurate judgments. - Q: What is the current state of LLMs in the legal profession?
A: LLMs are still in the early stages of development, and their use in the legal profession is limited. While they have shown promise, they are not yet capable of performing complex legal tasks with accuracy and reliability.








