Vote Now for the KMWorld Readers' Choice Awards !

The business uses of OCR

This article appears in the issue October 1998 [Volume 7, Issue 11]

Evaluating recognition tools and their role in business automation

Judging OCR's effectiveness depends mostly on what your documents look like and just what you hope to accomplish by digitizing them. This article attempts to frame the value of OCR (optical character recognition) and its alternatives by explaining how recognition works in getting paper documents into digital form.

Users and potential users of Xerox (www.xerox.com) TextBridge98, Caere (www.caere.com) OmniPage 8.0, Adobe (www.adobe.com) Acrobat Capture 2.0 and all their variants and competitors can apply the observations in this article to their actual applications. The author has extensively tested all of those packages and many more in arriving at these observations. The one solid fact in OCR: Accuracy and efficiency are determined more by the document than the OCR software package.

The goal of OCR: better than typingThe biggest factor that determines the effectiveness of OCR is the ability to capture the document in an acceptable format.

There are three primary categories of OCR users:

  • The desktop user-OCR is embedded in many common applications, including the word processors and fax programs people use every day. The ability to reduce a few pages of fax or scanned documents can be handy once in a while, and this class of occasional users can balance the tasks of OCR and editing against their own typing time.
  • The tactical user-OCR was adopted most enthusiastically by law firms early on because they could capture the textual content from their crucial paper documents better than the previous solution. When OCR works, like in this example of scanning documents of reliable quality and simplicity of layout, ROI is measured in hours.
  • The strategic user-OCR is used to meet corporate goals, such as print on demand or online publishing initiatives where paper communications are intentionally minimized.

For this article, we'll define "OCR" as the recognition of machine print, and "ICR" as the recognition of hand print.

The more characters there are on the page to be captured, the greater the impact of recognition in general. By that measure, books, manuals and office documents are the best subjects for successful OCR usage. Now, as more information moves to the Web, we see companies converting information assets, such as research libraries and core documentation, to electronic form. OCR has gained renewed significance and potential in the market.

The reason that OCR is a critical technology is that the largest cost of implementing new technology is the labor demanded by the change. In the case of most business publications and office documents, the cost of converting from a paper legacy may be minimized-if OCR is applied properly.

While the goal of OCR is to reduce the labor costs involved in document capture projects, OCR can never reduce those costs to zero. Fully automatic OCR is technically possible and even widely available, but the output is not suitable for most purposes. For that reason, labor remains the single biggest cost factor in even the most efficient applications. And most of that labor cost is accrued in Step 10 of the following list.

Paper to digital-basic stepsWhether done by a single user on a desktop or shared among hundreds of network clients, most scanning and OCR jobs follow the same path:

  • document prep-debind, remove staples, organize stacks
  • scanning-flatbed or automatic document feeder
  • batch management-control stacks of documents in process
  • job tracking-control documents through each step
  • job priority change-modify order in which documents are processed
  • rescan-if document is illegible, upside down, etc.
  • image enhancement-deskew, despeckle, thresholding
  • page segmentation-choose areas of page for specific handling
  • OCR processing-can be done on desktop or network server
  • quality control-editing and cleanup of OCR output.

Expanding the envelope

To make a major impact on the overall process, several successful solutions have been introduced to reduce error correction. The first class of labor-saving techniques included such technical advancements as split-screen viewing of source images and data entry forms online. Those techniques were designed to facilitate the cleanup necessary to correct OCR output, but they do nothing to reduce the number of real or suspected errors that need to be worked on.

A few pioneering companies have taken the lead in addressing the core question of accuracy for large-scale OCR applications. Widespread testing has proven that various OCR engines perform more or less accurately on classes of documents, and they each have relative strengths and weaknesses. Prime Recognition (www.primerecognition.com) has taken advantage of that diversity of talents among OCR engines by applying five or six of them to each page. PrimeOCR offers a high-powered network solution for the most demanding requirements. By applying all of the leading OCR programs to each page and comparing the results in a smart editing format, such demanding users as litigation services report up to 50% reductions in OCR errors. That 50% reduction in errors translates directly to a 50% cut in the overwhelming labor costs to clean up documents and often spells the difference between not doing a project and making it a winner.

For some applications, even the enhanced accuracy of multi-engine OCR leaves a lot of work to be done manually. For example, in conversions to SGML or XML, everything must be perfect. Not only must the words, letters and numbers be recognized accurately, but all requisite coding for identifying content and data must be captured as fully as possible. Innodata (SS) has automated that level of QC with a proprietary system that analyzes the output of multiple OCR engines and applies thousands of rules of syntax and grammar. Such rules will identify items as small as missing punctuation to the OCR results, thereby automatically leading the cleanup editor through every questionable instance. That process brings us closer to 100% automated/99.995% accurate OCR by reducing QC labor as much as possible.

What is "accurate" in OCR?

  • 1 nominal page, single-space, typewritten with 2,000 characters
  • 99.995 is commercial premium performance, expensive

However, no matter how automatic, OCR output still requires human attention and intelligent intervention. As we move documents to XML or PDF, for example, we will take advantage of those new media (formats) to add new value to the documents. Such value-adds didn't exist in the paper documents, so there is no alternative to investing in the labor or programming to create them in the new library. Even given that ultimate limiting factor, OCR offers dramatic cost savings in most conversions of business and academic documents to electronic usage.

The easy way for a novice or a CKO to assess a potential OCR application is to first consider the nature of the source documents. If they are machine-printed, whether from printing presses, laser printers or typewriters, there is a good chance OCR will outperform a human typist by 10 to 100 times their speed at the same or better accuracy. If the source documents are hand-printed (or worse, cursive handwriting), call in an expert or just hire a lot of typists. Every application is unique, but if you are looking into the feasibility of OCR, technology and techniques have never been more promising.


Search KMWorld

Connect