Government: leveraging resources for greater effectiveness
The volumes are posted on a Web site and are used by policymakers, scholars and the general public. Because most of the users are carrying out some form of research, the ability to search the documents is a high priority. Program managers for the FRUS series found that presenting the documents in PDF form, although easy to post online, did not provide a sufficiently powerful search function. Alternatively, having the volumes posted as a series of separate documents helped visitors find the right material, but the manual uploading process for the thousands of documents in each volume was very time-consuming.
After considering several options, the department decided to convert the documents to XML and use a native XML server to host them on the Web site, according to a State Department official involved in the project. The evaluation required extensive in-house learning, as the department looked into all aspects of the switch to XML. eXist-db, a free, open-source native XML database, was selected to store the content. It uses index-based XQuery for search functionality. An entity extraction-enabled product from Mark Logic allows identification of names, dates and places to enrich the documents and support more sophisticated in-house research than is exposed on the public Web site.
Many standard forms of XML are available, but the one that seemed to be the best fit was the one from the Text Encoding Initiative (TEI), a non-profit consortium that includes libraries, museums, publishers and individual scholars. Use of an established standard prevented the need for developing a customized schema that would have to be modified as new topics and their associated metadata were incorporated. An important interim step was the conversion of their back catalog of existing documents into XML format. The department outsourced that task, providing the contractor with a very detailed set of guidelines about how the metadata should be defined and marked up.
Advantages of the conversion affected both providers and users of the data. An entire volume is stored in a single XML file, which can be uploaded to the server by drag and drop. That method represents a dramatic improvement over the previous Web publishing process, reducing the time required from weeks down to minutes. Text is now automatically indexed and is immediately available to the search engine.
Enriched by entity extraction with XML metadata that is verified by researchers for accuracy, the text now allows for new ways of reading the documents and carrying out research. The combination of XML tagging and an improved search strategy has greatly enhanced the ability of users to zero in on the information they are seeking. The entity extraction feature enables enhancements such as presenting glossary terms or biographical information in a resource sidebar, events in a dynamic timeline and locations on a Google map as the reader views each document.
The availability of better tools and technology for XML helped the department make the decision to go with the XML model and will offer additional advantages in the future. For example, XQuery was approved as a standard by World Wide Web Consortium (W3C) in 2007. Now that Microsoft Office 2007 uses XML as its native file storage format, people can create XML documents without knowing XML. Documents can be written in MS Word and saved to SharePoint, and then they become immediately searchable and reusable at a granular level by XML servers such as Mark Logic Server.
XML is a key enabler in government applications, according to John Kreisa, director of industry solutions at Mark Logic. "XML applications can be accessed at speeds and scales that were not possible before," says Kreisa. "Relational databases are not designed for semi-structured textual information. With conversion to XML, customers see improvement of one to three orders of magnitude."
With the proper markup, the right section—not just a document—can be presented to the user. In addition, there are numerous ways to mark up the content to identify entities for extraction, which enriches the content and allows such functions as faceted navigation.
Enhanced analysis through entity extractionQueries run against databases can be effective analytical tools but also have limitations. Records in databases are sometimes inconsistent; for example, names may be spelled in various ways, which would not be picked up in a single query. Initiate Systems recently introduced a solution called Initiate Entity Resolution. The product is a statistical algorithm that calculates the likelihood that two entities are related.
"The customer can set the threshold for a level of confidence," says Scott Schumacher, chief scientist at Initiate Systems. "In law enforcement, users might want to see all the possibilities before excluding any, while in medical records, they might want a very high degree of certainty before relating different sets of information."
The solution can be used across any data source or combination of sources, and is suitable for searching unstructured as well as structured data. It is being used by the FBI Criminal Justice Information Services Division’s National Data Exchange (N-DEX), which will provide federal, state, local and tribal law enforcement agencies with a national system for capturing and sharing criminal justice information.