Data capture: extending the net to new capabilities

Paper-based forms still pervade the business world, despite the growth of online data capture. Account applications, invoices, taxes and a multitude of other forms remain largely paper-based. Major enterprises such as insurance companies and financial institutions have made significant commitments to dealing with this paper through automated data capture, scanning in forms to back-end systems. But U.S. companies, particularly small to midsize organizations, still spend billions of dollars each year keying in data. The cost of that work is estimated at $15 billion per year.

Unfortunately, many companies do not realize that the cost of implementing automated data capture rather than manual keying is relatively modest--generally much less than the investment they already have made in their back-end systems. Labor savings over keying from paper or image can be considerable, producing a solid ROI in a short time. In addition, new capabilities in data capture, such as XML and data capture from unstructured forms, are making imaging systems more versatile.

The ability to handle eXtensible Markup Language (XML) is a feature that most data capture companies have been adding to their repertoire. XML is becoming the de facto common language that can cross the lines between different enterprises, supporting e-commerce and other e-business initiatives. InputAccel from ActionPoint, for example, is geared for mid- to high-end applications and can convert scanned data to XML for delivery to back-end systems. After optical character recognition (OCR), the data is converted to the designated output format and directed by the InputAccel server to the appropriate repository. Data from a form can be split up so that one field is indexed and stored in a document management system, while another, such as order status information, is converted to XML for ready deployment on the Web.

“One of our customers is converting scanned information from its shipping records into XML so that its freight customers can view it online,” says Alan Abrina, product manager for InputAccel. The application illustrates the value of XML in extending accessibility of data to users outside the enterprise. Abrina also cites InputAccel’s open architecture as a strong asset. “Our customers can choose technology from among our offerings or integrate third-party software that complements our suite,” he says. “InputAccel lets users adapt to changing technology as well as to changing requirements.”

XML can also be captured at the front end. Ascent Capture from Kofax can import XML, printstream and other input, and convert it to an image. The image is then put through the OCR process and output in the desired format.

“The value of this approach is that the same system can be used for capturing electronic and paper-based data,” says Tom Rossi, senior marketing manager at Kofax, “which saves time and money. Anything that can talk to XML, such as enterprise resource planning (ERP), can also talk to AscentCapture.” Ascent Capture exports 50 different formats (including XML); data can be sent to FileNet (filenet.com), Documentum (documentum.com), and IBM’s (ibm.com) Content Manager, as well as to repositories of software products from many smaller firms. AscentCapture is also tightly integrated with Microsoft’s (microsoft.com) newly introduced SharePoint.

One of Kofax’s strengths is an improved scanning method called virtual rescan (VRS), which was shown by Doculabs (doculabs.com) to produce a 35% increase in character and field recognition accuracy. The process evaluates the scanned images in real time and adjusts for brightness, contrast and other characteristics. VRS provides images that are easier to read, which also facilitates keying from image, and improves recognition significantly during the OCR process. Because the image quality is higher, users rarely need to rescan images, a time-consuming and expensive process.

Datacap converts all intermediate data from OCR and intelligent character recognition (ICR) into XML format in its Task Master 5.0 three-tier (client, server and browser) capture environment. The data remains in XML while operations such as verification and corrections are performed. It can then be retained in XML or readily converted to some other format.

“XML data is Internet-friendly,” says Scott Blau, CEO of Datacap, adding that it can also be delivered to legacy systems. By using XML, Datacap reduces the amount of information sent between browser and server, while making it easier to map to legacy systems through an XML style sheet (XSLT).

Some vendors report that requests for XML conversion are more common from new companies than established ones. However, older companies that have grown through acquisitions and mergers find that XML is one way to integrate data across multiple storage systems without major data conversions. Although developing the right schema is non-trivial, it is easier than trying to integrate data physically that resides in disparate systems.

Unstructured form market

Data capture from structured forms is a well-established and mature technology. On those forms, the data appears in predictable locations. Some imaging companies are taking on the challenge of capturing data from unstructured forms, a more difficult task. Those documents include invoices and other forms that have standard content but no standard format. They are referred to as unstructured or semi-structured forms. The capture process is sometimes called free form recognition.

“The absolute hottest issue in data capture is unstructured forms processing,” maintains Jerry Evans, product manager for Microsystems Technology. Microsystems developed its AnyForm product as an integrated solution within its OCR for Forms software. The first application was for invoice processing, but more applications have emerged that fit in the same footprint. The AnyForm software allows users to specify the data items they want from a checklist on screen. An underlying rule base allows the system to find the desired items. Only a small subset of information contained on the form is processed, but the form is imaged and indexed so that all the information is retained. The image, index and data are all linked so that at the back end, they can all be pulled up together.

Previously, the only options for dealing with unstructured forms were to key in the data or create a template. “One company had been adding templates for years to accommodate various forms,” says Evans. “When we installed AnyForm, the need for this process was eliminated in a single day.”

Built-in intelligence also allows Microsystems’ data capture products (for both structured and unstructured forms) to perform such operations as double checking the math on an order form and verifying addresses from U.S. Postal Service databases. Once the data is captured, business rules can be used to route forms to the appropriate individual. For example, if an invoice amount is greater than a specified level, it can be routed differently than a smaller invoice.

Captiva Software, another strong contender in the unstructured forms arena, first began exploring that market in response to inquiries from companies that process property tax bills for mortgage companies. Although the forms, which are sent by local governments, contain standard information, they are not in a standard format. After developing and implementing the Free Form technology for the application, Captiva opted to focus on a specific solution, that of invoices.

“Accounts payable is a very large market that cuts across many vertical industries,” says Captiva CEO Reynolds Bish, “including manufacturing, retail and other segments with large volumes of paper invoices.” Based on the number of firms using such forms, Bish believes that the market for data capture from unstructured forms is considerably larger than that for structured forms.

Free Form uses neural network technology to identify and capture data from unstructured forms. When the data field cannot be located, the system presents the form to the operator, who keys it in rather than using the alternative strategy of marking the field locations on screen. Through that process, the system learns by example the location of the data on the form. That strategy keeps up the keystroke pace by operators (who are already handling corrections) and automates the system’s learning process.

Judith Lamont is a research analyst with Zentek Corporation., e-mail jlamont@sprintmail.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Save with Early Bird Pricing for KMWorld 2026!
Register NOW and join us November 16-19

Data capture: extending the net to new capabilities

Mining Business Knowledge From Unstructured Data

Checklist Report - Preparing for Agentic AI: KM Playbook

2026 State of KM & AI Report

More

The AI Knowledge Maturity Model: Assessing Readiness and Measuring Progress

Closing the Knowledge Gap: Strategies to Deliver Answers at Scale

KM + RAG: Building Trustworthy, Context-Aware AI

Building Trust: Embedding Privacy into the Information Lifecycle

More Webinars

Save with Early Bird Pricing for KMWorld 2026!Register NOW and join us November 16-19

Data capture: extending the net to new capabilities

Save with Early Bird Pricing for KMWorld 2026!
Register NOW and join us November 16-19