The key to the future: Intelligent document recognition
Printing paper and moving paper are both expensive and time-consuming. Companies that need to cut those costs and manage processes more effectively are simply converting documents that were being sent by mail into faxes or e-mailing PDFs. From the recipient's standpoint, the incoming documents are all unstructured or semi-structured. Whether image-based or data (for example, PDF normal), the documents might look similar (all invoices), but the data elements do not contain understandable metatags. Therefore, each document must be looked at and interpreted into a common format that is understandable by the IT backend procedures. And that is an expensive process.
Capturing data from images using traditional forms processing is based on knowing the specific form layout so that you can build a template to locate which fields to capture, the rules to use for each field and any cross field validations. The template also defines the associated output metadata for the fields. Output is usually comma delimited or XML tagged, although in some cases can be electronic data interchange (EDI). It only works well when the layout of forms is the same or where clear identifiers define the format. This has confined forms processing to turnaround documents or regulated forms, like tax returns, credit card applications, medical claims, etc.
Capturing images for indexing using batch capture software requires the manual insertion of coded batch and document separators between documents that provide automated retrieval index metadata. Release scripts format the images and data for the backend document management (DM) or enterprise content management (ECM) solution. Capturing data from unstructured, unknown data layouts can use search engines. Those hunt through unstructured text to identify and extract contextually relevant documents and phrases. However, to create understandable metadata for output into business processes requires business-specific rules, which means that the software must understand what the document is.
New intelligent document recognition (IDR)--technologies originally developed for invoice processing and the electronic mailroom--uses techniques from each of the above areas and eliminates the limitations. It is no longer necessary to know what the form layout looks like. It is no longer necessary to insert separators. It is no longer necessary to presort. Specific rules can make the data understandable. IDR has the ability to figure out what the document category is and apply the appropriate business rules.
IDR, which is also called intelligent data capture (see P.10, July/August KMWorld, Vol 15, #7, which is the Robert Smallwood article also online) works a lot more like humans, relying on training and an internal knowledge of the layout and content of generic forms types, which is used to understand and extract required information and initiate workflows. That widens the types of forms that can be captured and reduces costs, but IDR also changes capture capabilities substantially into a series of tools that have the ability to interpret and extract data from all sorts of unstructured information.
The information can be input as scanned paper or document formatted information, whether it is data-centric, such as Word or PDF normal, or image-based. Typically that includes and leverages multiple different methods including pattern recognition, OCR and other recognition and search engines to locate and extract required information before applying business rules to it. This was the reason that DICOM/Kofax bought Mohomine and LCI, Captiva bought SWT and one of the reasons behind Verity's purchase of Cardiff.
IDR capture provides the ability to make sense of and help manage the unstructured, untagged information that is coming into the corporation or organization. It can provide the front-end understanding needed to feed business process management (BPM) and business intelligence (BI) applications, as well as traditional accounting and document or records management systems.
Big companies in the BPM and BI space are beginning to realize the importance of this, and the big news in 2005 was EMC's acquisition of Captiva--a major supplier of capture solutions. The purchase caused many people to start to look at capture in a new light. EMC stated that the acquisition was to improve document life cycle management. Although that is a term not well understood outside the storage industry, EMC clearly saw much more to it than just capturing documents. Capture is evolving into a critical business systems need that improves core business processes and competitiveness through its development of business rules-based document understanding.
According to our research, the capture software market grew by 18 percent between 2004 and 2005 to $1.1 billion at user cost--up from 14 percent in the previous period. We break that down into four sub-segments, which we call:
- ad hoc and desktop scanning--used by office workers who want to convert paper documents into usable electronic documents on which they can work or collaborate. The devices used are slow-speed scanners or networked office digital copiers (multifunction machines).
- batch and distributed batch scanning--used to get documents into a centralized document repository or used to classify and route them to a centralized point as quickly as possible.
- full-text capture OCR--converts textual documents, such as scanned magazine articles, into ASCII data that can be edited or managed or used to find documents.
- transaction capture and process management--previously forms processing. Similar to batch and distributed batch capture, but the output is data-centric and used to provide data for use in a business process.
Those sub-segments each showed some interesting trends. The first three grew at more than 20 percent, driven by a number of key issues that are coming together to cause market stress:
- Increased business velocity, which requires moving documents electronically instead of via the mail or couriers. Incoming paper needs to be converted to image as close to the point of entry into the corporation as possible or, preferably, by the sender using fax servers or e-mailed PDFs. That is causing a move from centralized scanning to distributed scanning.
- The business need to reduce costs. The expense of handling, moving and storing paper can be reduced or eliminated through converting it to an image in combination with small DM systems, while office collaboration tools such as Microsoft's SharePoint can improve efficiency.
- The need to optimize business equipment usage. Freestanding copiers are often underutilized. By networking them, corporations can use them for printing, faxing and scanning, as well as copying. Businesses are looking at other alternatives such as distributed scanners and printers, and copier manufacturers are making a push into the capture markets.
- The availability of lower-cost (sub $1,000), easier-to-use (single button), quick (25 ppm or more) distributed duplex desktop scanners.
- Full-text search at the desktop increasingly used to locate and manage documents in the small office.
- Images being standardized and accepted just like data. In particular, PDF and the Internet have caused users to stop differentiating between images and data. The hybrid, searchable image, combined with improved optical character recognition (OCR), has helped with that by blurring the line.
The transaction capture and process management sector is currently the largest segment, accounting for 34 percent of the overall market, but it grew slowest at 7 percent, dropping from 38 percent of the total. That might seem counterintuitive because those forms processing-inspired technologies are the key to future growth and feed into knowledge management (KM) systems. Right now, however, the bulk of the segment still consists of fixed template-oriented sales. That proven forms processing technology offers some major cost reductions over in-house data entry or even offshore processing, which is increasing in cost, if the forms are known.