Processing unstructured forms on the fly
By Arthur Gingrande
According to Laura Ramos, director of Research at Giga Information Group, unstructured data accounts for nearly 80% of all corporate data on record. That includes all data types, including e-mail and voice messages, presentations, videos, attachments, paper documents and forms. The overwhelming volume of that data makes processing unstructured data for storage and consumption a major priority for most corporations. In the world of forms, the hottest, high-volume ICR application is processing unstructured forms, such as purchase orders and invoices.
The term “unstructured” is a misnomer. There is no question that purchase orders and invoices, for example, are consistently structured from one form to another—at least from the standpoint of the issuing company. The term becomes meaningful when it is interpreted from the perspective of the receiving company, which must process the myriad of purchase orders, invoices, shipping documents, medical claims or explanation of benefits (EOB) that arrive daily in a variety of layouts, differing arbitrarily from one sender to another. Random differences mean that the forms cannot be processed using the traditional form template-based approach, in which one software template matches each and every data field on all the forms in a presorted batch. Instead, the forms must be processed on the fly, using sophisticated algorithms that locate the data on a diverse array of form layouts within a given form type. A more accurate term would be “semi-structured” or “loosely structured” forms—labels that are slowly creeping into popular use within the information technology community.
Examples of unstructured forms would include invoices, purchase orders, medical claims (HCFAs, UBs), explanation of benefits forms, and shipping documents (bills of lading, customs declarations and so forth). Those categories of high-volume documents would normally require manual data entry to capture the information they contain; they are now being automatically processed using intelligent software algorithms that locate the data fields and recognize them with industrial-strength accuracy.
Processing unstructured documents is nothing new to the field of forms automation. Techniques for processing forms on the fly were pioneered in the mid-1990s by Nestor, Symbus (renamed Captiva), Mitek Systems, Daimler-Benz (changed to OCE) and Microsystems Technology and others. Those companies typically used neural networks to analyze and generalize the image morphology of a particular class of unstructured forms such as faxes, invoices and medical claims. Advanced feature extraction techniques helped to facilitate page decomposition, and complex search routines were used to locate specific data fields on variable and problematic forms. The software could locate form data fields even if they had moved substantially from their expected locations by as much an inch. However, due to the complexity of the algorithms and the expensive, pre-Pentium hardware requirements, attaining acceptable processing speeds at reasonable accuracy levels was costly; hence, during the 1990s, the software failed to achieve mainstream user adoption.
Today, however, given the incredible computing horsepower and memory that resides in the average desktop PC, a variety of techniques, including brute force, can be run in parallel to achieve remarkable results. Blob analysis, edge detection, multiline character segmentation and long-line location can find form objects, columns and data fields based solely on their topology. Sometimes the geometrical and spatial relationships between the text data elements, such as rows or subheadings (rather than graphical objects), are used to locate the places where data most likely will be found. In fact, the process need not involve character recognition at all; the text can be treated as a pattern of blobs.
During application setup, the user can define a dynamic template or a map that describes the data zones contained on the form type in question. Some vendors use a set of image-parsing rules written in a VB program or special scripting language to guide the software in searching for and locating data fields. The fields are often detected at the pixel level with fine precision, which allows for granular-level separation of wanted and unwanted data elements—especially desirable when processing densely packed forms like HCFAs and EOBs. Predefined attributes can include the size of one data field relative to the size of another, number of characters, the width of an expected logo relative to the width of certain data columns on the page, etc. After the fields are located, an OCR/ICR engine recognizes the data and validates the results through rules and lookup tables.
Alternatively, some vendors utilize powerful OCR/ICR-based, literal search procedures coupled with state-of-the-art form removal technology to automatically remove the form, locate the data fields, then extract and massage the required data according to user-defined rules. Users employ a special dialog box to construct routines and edit, map and present the results. They base the routines on predefined search parameters, frequently displayed in a flowchart format. Results are converted into a variety of graphical displays, including pie charts, bar charts, X-Y coordinate graphs, etc. OCR confidence values can control data validation rules. Most unstructured document processing software products are hybrids: It is not uncommon for vendors to apply a variety of rules, character recognition engines and morphology generalization techniques, all in parallel with one another, to optimize performance.
Most vendors support their unstructured document processing engines with dynamic template libraries that accelerate the process of identifying forms and locating their data elements. The user designs a series of generalized form templates that recognize form types by customer, organized by company and document layout. Dynamic form templates do not define data element regions by exact pixel location the way conventional templates do; rather, they are set up by form region, using rules based on topological structures or text elements found in those form regions. A predefined, dynamic template library can usually handle its batch-processing assignment. But if the software detects an exception, the variations between that form and the nearest matching template are noted and stored either as a new or variant template.
A number of vendors make products with the capability of processing unstructured documents. To date, no benchmark tests of recognition accuracy have been published, so it is impossible to quote accuracy rates. Hence, if you are a potential customer, there is only one way to determine what product to use: Have the vendor process a few batches of your own invoices, medical claims or other unstructured documents that you regularly process and see how well the software performs.
Unstructured document processing technology is just starting to vie for mainstream user adoption. The most popular applications include:
Invoices--Every major company processes mountains of invoices every day, not to mention their sister form, the purchase order. Yet, their arbitrary format makes it necessary to enter the data they contain manually. Nowadays, every major forms processing vendor offers an invoice solution that can also do purchase orders.
Medical claims—HCFAs, UBs and dental claim forms are composed of inconsistently placed, densely packed data fields—each form with its own peculiar array of lines, boxes and fine-print instructions—available in red, green and black, and each color presents its own set of imaging problems. Add in the OCR complications of degraded font recognition created by claims filled out by faded ribbons from old dot matrix impact printers chronically used by doctors, and processing medical claims becomes the most challenging task in forms automation.
Explanation of benefits forms--Derived from medical claims, EOBs are hard to recognize for much the same reasons as HCFAs. The overabundance of variable width columns and tightly packed data fields makes EOBs particularly difficult to recognize by conventional recognition systems.
Transportation documents--Those include shipping documents such as bills of lading and customs declarations. Like invoices, the forms are processed by parties that did not originate them, so their layouts appear arbitrary to the processor. Moreover, the use of carbon copies and the beating that these documents take en route to their destinations compound the recognition problems involved.
Vendors and Products
Vendor, Product, Applications
Captiva (captivasoftware.com),InvoicePack, ClaimPack-Invoices, medical claims
Cardiff (cardiff.com), MEDIclaim Option-Medical claims
Ceresoft (ceresoft.com), InvoiceAgent, EOBAgent -Invoices, EOBs
dakota imaging (dakotaimaging.com), HealthClaim Expert-Medical claims
Datacap (datacap.com), BDOcs, HCcs -Invoices, medical claims
Microsystems Technology (microsystemsonline.com), AnyForm for Invoices, Quick Pass, Quick Assist -Invoices, EOBs, medical claims
OCE (cgk.de/english) , DOKuStar-Invoices, shipping docs
Open Scan Technologies (openscantech.com), Dynamic Template -Invoices, payment docs
ReadSoft (readsoft.com), Eyes & Hands Invoices-Invoices
Recognition Research Inc. (rrinc.com), FormWorks for Health Insurance - Medical claims
SER Solutions (ser.com), SERdistiller -Invoices, medical claims, payment docs
Top Image Systems (topimagesystems.com), eFlow Freedom, AFPSPRO-Invoices, medical claims
Arthur Gingrande is a partner with IMERGE Consulting (imergeconsult.com), 781-258-8181 or e-mail email@example.com.