PDF: Please Don’t Fail – Optical Character Recognition and the Holy Grail of Intelligent Automation

By Chris Surdak, JD Senior IRPA AI Contributor

If there is a “Holy Grail” of business process automation, it’s likely the automatic reading — and understanding — of the lowly portable document format (PDF). To most humans, this task seems mundane, if not downright punishing. But in the world of software and algorithms, reading and understanding such documents is notoriously difficult.

Many, if not most, organizations seek to automate the process of reading such documents by leveraging technology, such as Robotic Process Automation (RPA), Machine Learning (ML), and Artificial Intelligence (AI). Given the history and mathematics involved, it’s unlikely this challenge will be solved in a comprehensive way any time soon. It’s important to appreciate just how difficult a task this is. For most organizations, even small advances in this process may lead to significant business gains, and incremental success should encourage, rather than confound, organizations’ efforts to this end.

If it were easy, it would already be done

Here in 2021, it’s rather easy to take the level of automation achieved over the last half-century for granted. We literally live in the world of “The Jetsons,” as flying cars, 3D printed food, robots, and video phone calls are now science fact rather than science fiction. And the pace of advancement continues to accelerate as these innovations resonate and build upon one another in waves of disruption crashing on the shores of our traditions, habits, rules, regulations, and ultimately, our expectations.

Let’s give ourselves a little bit of credit. Those business tasks and processes that lent themselves to easy automation were automated ten, twenty, or even thirty years ago. What remains is the hard stuff: that which by some combination of cost, complexity, or compliance has eluded all our prior attempts at automation.

If we accept this reality, we must also accept that our wishes for the most cantankerous and intractable business issues — and  business users  — will remain unsatisfied in the digital era. And the business needs that remain unfulfilled are naturally the most difficult to solve either technically, or more often, politically.

Old habits die hard

As companies adopted widespread business process automation throughout the 1980s and 1990s, they faced a significant problem: the need for legal traceability and compliance in digital transactions. Previously, business deals were consummated with a handshake and a signature — with emphasis on the latter. For centuries, bookkeepers kept books, and in those books were signatures or “marks” which showed people were legally bound to the words and numbers recorded on the physical page they signed.

Digitization meant those who lived in the world of “books” and “pages” had to separate themselves from the physical documents — something they simply could not bring themselves to do. Today, people may not know or remember just how much consternation digitization created among legal, audit, and financial professionals back in the day. But it was something of a holy war, and it shook some professions to their very foundations — physical papers with physical signatures written upon them.

It only took 30 years for attorneys, accountants, and financiers to accept PDFs over hard copies. When they finally capitulated, it was a major concession and a major step toward ending 19th-century business and legal practices. Business processes evolved to work with this new form of legally binding document over the same 30-year period.

In response to this victory, organizations built document-processing capabilities whereby PDFs are collected and fed to human workers who read, interpret, and hand key pertinent information from them into various business information systems. Once keyed in, relevant information can be processed by digital means using digital systems. Known as ”swivel chairing” — the process of swiveling from the screen with the PDF to the screen where its contents must be keyed in — this activity is the final frontier between the old, paper-based, analog world and the new, computer-based, digital world.

Life in this frontier is slow, monotonous, error-prone, and expensive. Ironically, the same people who fought so hard against acceptance of the PDF are now actively trying to accelerate its demise as the cost of transposing information from these image files into our systems is little different from that of transposing from hard copy to digital. We digitized the means of transmission, but we have yet to digitize the means of consumption.

OCR to the rescue?

Optical character recognition (OCR) technology has been around for a long time; I first worked with it in the mid-1990s while consulting with a telecommunications company that scanned checks sent in by its customers. The company’s check scanners cost millions of dollars apiece and could scan thousands of checks per hour at a reasonably high level of accuracy (i.e., around 99%). While expensive, they paid for themselves quickly since they replaced dozens of humans who previously had to key in the same information by hand.

One might assume by now, a quarter of a century later, OCR technology would have advanced to the point of near perfect accuracy in reading such scanned documents. But if one were to make such an assumption, one would be wrong.

It’s not unusual to hear OCR vendors claim to provide 99% scan accuracy — which sounds quite good. But they don’t tell you accuracy is measured on a character-by-character basis. Unless you’re reading documents filled with one-letter words, “99%” is somewhat meaningless to actual outcome accuracy.

Let me explain. Apply the 99% accuracy to the five-letter word “donut.” The chance that 99% accurate OCR will get “donut” right is:

99% * 99% * 99% * 99% * 99% = 95.09%

which is neither terrible nor great.

But if this is one word in a document with 500 words (i.e., an average page), then the number of errors on the page, given all five-letter words, is 25. That’s a lot of errors to correct. Given the loquacious and garrulous nature of most legal speak, five-letter words are the exception in a contract rather than the rule. And rare is the contract that doesn’t span dozens, if not hundreds, of pages of text.

By now, this should be readily apparent: The likelihood of OCR successfully scanning through any document more complex than a check is vanishingly small.

More math to the rescue?

For those of us working deeply in the space of natural language processing (NLP) and unstructured data analytics, Claude Shannon is both a legend and a fiend. Shannon was an expert on “information theory” and spent his entire life trying to solve the “fundamental problem of communication.” This was a man who needed some hobbies but who, nevertheless, put together some complex and gorgeous mathematics which continue to help practitioners understand the challenges of analyzing communications.

One result of Shannon’s Information Theorem — conceptually complex but pragmatically easy to leverage — is that in a given document, a longer word which also appears infrequently has more relevance to the true meaning of the document. In other words, if there is something Finance and Legal want to know from a given document, it will likely be a long word that only appears once or twice.

Consider a typical contract. It will name the relevant parties at the beginning — we’ll use Kermit and Snuffleupagus — and thereafter refer to them as “party of the first part” and “party of the second part.” The word “party” will appear repeatedly throughout the document while “Kermit” and “Snuffleupagus” each appear only once, so if you want to know what the contract is about and who is involved, there’s a good chance “Kermit” and “Snuffleupagus” are important clues — because they are big and rare.

Similarly, a purchase order (PO) or invoice — two popular grails in the current quest for automation — typically includes a single  appearance of its corresponding PO or invoice number in the document. For a large company, these numbers are also likely to be long because big companies use a lot of POs and invoices. Shannon’s would thus imply that these numbers should be easy to find due to their size and scarcity. This sounds like the identification problem is easier, but when we return to OCR’s accuracy numbers, we begin to see the conundrum we face.

Recall OCR’s accuracy of 99-ish% is on a per character basis. Apply the 99% accuracy number to a larger “word,” such as a U.S. Social Security number (SSN), and we have 11 characters: ABC-DE-FGHI. The resulting accuracy would be:

99% * 99% * 99% * 99% * 99% * 99% * 99% * 99% * 99% * 99% * 99% = 89.53%.

So, there is a greater than 1 in 10 chance the SSN will be misread on a given scanned document. You might readily recognize it as a SSN, but you still have a 10% chance of reading it incorrectly. If you built a bot to process 1,000 such documents per day, it would throw you 100 erroneous documents per day.

Not all errors are the same

The challenge of these 100 errors is not just that they exist but that you won’t know the kind of error that has occurred. Did the OCR fail to recognize the number as a SSN and think it was something else? Did it misread a correct SSN, or did it read an incorrect SSN as if it were correct?  The nature of the error has a significant impact on the nature of its associated risks and costs, and these possibilities must be taken into account BEFORE you decide to automate.

A false positive is when the system thinks there is an error when there is not. This is a waste of time, attention, and money. It’s inefficient but likely not fatal. Alternatively, a false negative is when there is an error the system fails to recognize. These are the bad errors — the ones that make Finance and Legal cringe in fear or curtail the career of one or more executives. I have seen false negative errors occur in production, and they can cost an organization millions of dollars in one night.

Accept and address the complexity

Because of these risks, and the limits inherent in some of these technologies, many executives have retreated from their adoption of IA technologies, such as RPA. Having been burned once, twice, or thrice, some have chosen to look to other alternatives, such as AI, to address their automation needs. But poor adoption is poor adoption, and a more complex technology is not likely to generate better results under such circumstances. It would be like trying to learn to juggle and starting with chainsaws rather than red rubber balls.

It is far better to change your technique than your technology. The systems engineering knowledge required to automate such processes effectively is rare, but it’s out there. As research firm Gartner stated in 2018, only 4% of organizations were successful deploying RPA at scale. So, it is possible; it’s just not as easy as some would suggest.

Solid systems engineering, process modeling, and incremental, experimental deployment all contribute to success with these tools — just as shortcuts and false promises contribute to less than stellar results. The current wave of automation adoption is not optional — any more than having a website was optional by 2010. As such, it behooves organizations to adopt IA — and adopt it with some intelligence.

Lessons from the front lines

As I outlined in my book, “The Care and Feeding of Bots,” there are several best practices for using RPA correctly. I have seen organizations derive significant value from this process by approaching PDF digestion automation the right way. For automating PDF reading and extraction, I offer the following as a good foundation:

Clearly communicate value and limitations. 99% accuracy sounds pretty good, but as shown above, when you really dig into and understand the numbers, this level of accuracy is not likely to meet your business expectations. Limitations must be understood and communicated before committing to an automation program rather than uncovered after the fact.

Trust but verify. All processes have an error or failure rate. No process is perfect. The potential for failures can be modeled and anticipated, and mitigation tools and techniques implemented, prior to deployment. All these things CAN be true, but ARE they true for you? As you design and build your automations, ensure these solid systems engineering principles are applied from the beginning and rigorously tested prior to production use. Then ensure various monitoring logic and gates are applied, so things don’t go off the rails unexpectedly. Failure is inevitable; failing gracefully isn’t.

Seek symbiosis versus automation. Most organizations pursue a “bots instead of humans” approach to automation. This is a mistake. Bots are great at some things (e.g., repetitive tasks) and lousy at others (e.g., pattern recognition). Humans are generally the opposite. Trying to entirely replace humans with bots is akin to throwing the baby out with the bath water — with similar distressing results.

This is where symbiosis between an attended bot and a human being can really shine. The bot performs the repetitive and tedious clickety-click required between different systems, and the human is free to focus on the most critical pattern recognition task of “what is present in the document,” which remains stubbornly hard to encode. Bot and human form a team, and each contributes according to their strengths while accommodating the others’ weaknesses. Too often, this symbiosis is stumbled upon by accident rather than stated as a strategic goal.

Learn by doing. RPA is often sold like a late-night TV rotisserie: Set it and forget it. Such automation is rarely achieved. The Heisenberg Uncertainty Principle roughly states that the act of measuring something changes what is measured. Process automation loosely follows the same principle: The act of automating a process changes the process. As such, you’ll never design the perfect bot before you deploy it. Likewise, you’ll never deploy a successful bot that doesn’t require some ongoing care and feeding — hence the title of my book.

RPA is different from nearly all other information technologies in that its cost of maintenance almost always exceeds its cost of acquisition. This is the opposite of nearly all other software investments, which is why many organizations have struggled to achieve a positive return on investment (ROI). To use RPA correctly, build it, deploy it, use it, AND expect to change it, modify it, and tweak it for quite some time to come. One of the most successful bots I ever witnessed in production went through 27 revisions in its first year of operation. And like fine wine, it got better and better with age.

Be pound wise. RPA doesn’t follow the normal rule of “acquisition costs more than maintenance.” So, the normal procurement approach is all wrong when it comes to buying RPA. The hyper-sensitivity to upfront cost is deeply ingrained in most organizations’ purchasing DNA, and that’s not likely to change for one investment which just so happens to break the rules of the IT industry.

Unfortunately, this drives a lot of incorrect behavior in the acquisition of IA technology. Organizations who run out and hire Jack and Diane’s Custom Cakes, Small Engine Repair, and RPA Consultancy, LLC because their proposal was 75-cents less per hour than the next lowest bid are going to get exactly what they paid for and should expect nothing less — or rather nothing more. The process may have gone perfectly from a “procurement” perspective, but the results for the business will almost certainly be underwhelming.

The uncomfortable truth that stems from Gartner’s research on RPA success is this: If only 4% of organizations are succeeding with RPA at scale, you need to be engaging with the top 4% of available expertise — not the bottom 4%. I ask companies all the time about their RPA efforts, “Are you trying to succeed, or are you trying to fail as cheaply as possible?” Adopting RPA should not be something to post about on Tik Tok. It must be treated as a necessary and strategic initiative to remain relevant in an ever more digital world.

Am I saying, “Just spend a lot”? Far from it. You must also keep in mind that the most expensive is not always the best. I know firsthand that when you’re paying an hourly rate for expertise, “expertise” is often the least of what you’ll get. Are you paying for expertise, or are you paying for expertise, a senior partner, three partners, two directors, a manager, an art department, a contract department, sales and marketing, regulatory compliance, a receptionist, and a partridge in a pear tree? How precisely are each of your dollars spent when you acquire an “expert”?

How do you pick wisely? A few rules of thumb to consider include:

  • Can they explain things plainly and succinctly? As Einstein reportedly said, “If you can’t explain it simply, you don’t understand it well enough.”
  • Ask for more case studies than offered. If an expert provides X case studies, ask for X plus at least 3, and focus on the last 3. Most people only offer as many as they have — or only their best.
  • Use references. PowerPoints and LinkedIn tell a story. See if the story is consistent with the lived experience of others who have been there.


By now you may be asking yourself, “Is this even worth it? All I wanted to do is have my PDFs read into my ERP system!” Fortunately, or perhaps unfortunately, the answer is, “Yes, but . . .” I have personally seen organizations achieve great, repeatable, and sustainable results in OCR data extraction with RPA. I have also seen poorly planned and executed “solutions” which cost their owners many, many times their cost of acquisition in failed processing. The key takeaway is this: It won’t be easy, but it will be worth it. Patience and perseverance win in the end!

Featured Content

Latest News


Are you ready to take the first step and learn more about RPA?

Contact Us