AI Models for Document Processing: A Review of Current Performance
However, these promotional claims don’t always match real-world performance, according to recent tests. “I’m typically a pretty big fan of the Mistral models, but the new OCR-specific one they released last week really performed poorly,” Willis noted.
Real-World Performance Issues
“A colleague sent this PDF and asked if I could help him parse the table it contained,” says Willis. “It’s an old document with a table that has some complex layout elements. The new [Mistral] OCR-specific model really performed poorly, repeating the names of cities and botching a lot of the numbers.”
AI app developer Alexander Doria also recently pointed out on X a flaw with Mistral OCR’s ability to understand handwriting, writing, “Unfortunately Mistral-OCR has still the usual VLM curse: with challenging manuscripts, it hallucinates completely.”
Current Leaders in AI Models for Document Processing
According to Willis, Google currently leads the field in AI models that can read documents: “Right now, for me the clear leader is Google’s Gemini 2.0 Flash Pro Experimental. It handled the PDF that Mistral did not with a tiny number of mistakes, and I’ve run multiple messy PDFs through it with success, including those with handwritten content.”
Gemini’s performance stems largely from its ability to process expansive documents (in a type of short-term memory called a “context window”), which Willis specifically notes as a key advantage: “The size of its context window also helps, since I can upload large documents and work through them in parts.” This capability, combined with more robust handling of handwritten content, apparently gives Google’s model a practical edge over competitors in real-world document-processing tasks for now.
The Drawbacks of LLM-Based OCR
Despite their promise, LLMs introduce several new problems to document processing. Among them, they can introduce confabulations or hallucinations (plausible-sounding but incorrect information), accidentally follow instructions in the text (thinking they are part of a user prompt), or just generally misinterpret the data.
Conclusion
In conclusion, while AI models for document processing have shown great promise, their performance can vary greatly in real-world applications. Google’s Gemini 2.0 Flash Pro Experimental currently leads the field, but its competitors are not far behind. As the technology continues to evolve, it is essential to be aware of the potential drawbacks of LLM-based OCR, including the introduction of confabulations and hallucinations.
Frequently Asked Questions
Q: What is OCR, and how does it work?
A: OCR stands for Optical Character Recognition, a technology that enables computers to read and understand printed or handwritten text from images or scanned documents.
Q: What are LLMs, and how do they relate to OCR?
A: LLMs stand for Large Language Models, a type of artificial intelligence model that can process and understand human language. In the context of OCR, LLMs can be used to improve the accuracy of text recognition and extraction from documents.
Q: What are the limitations of LLM-based OCR?
A: LLM-based OCR can introduce confabulations or hallucinations, accidentally follow instructions in the text, or misinterpret the data. These limitations can affect the accuracy and reliability of the results.