-->

KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

XML Search and Query: A New Frontier for Search

The term "best practices" describes methods, processes and practices that have performed exceptionally well, and are widely recognized as improving an organization's performance and efficiency. This article describes best practices in enterprise XML search and query (XQuery) applications.

XML-based content aggregation shares many features with traditional net crawling. Documents are accessed through Web servers, content management systems, possibly databases, and most often the file system. In the XML environment, a choice of XML representation needs to be made for opaque binary formats like PDF and MS Office. Natural choices for target XML schemas include the W3C standard XHTML, Norman Walsh's DocBook and the MS Office 2003 XML schemas. Numerous industry-specific schemas exist which are appropriate choices for applications in those industries, (e.g. FIXML, RIXML and XBRL etc. for the financial services industry).

Enterprise search differs from Web surfing. Enterprise users want to find specific information in the course of performing their jobs. They expect to get back a well-filtered list of results containing direct answers to specific questions. They may expect to see known documents. The enterprise user frequently needs to drill down inside the document to find elements of content, possibly sections, chapters or even individual paragraphs (e.g. contract provisions) that contain information they are looking for.

Best practices in XML search and query merge external hierarchical directory structure with internal hierarchical document structure, and allow users to access content at any useful level of granularity and context. Documents that contain many different topics are best stored, classified and accessed at a micro-document level.

For example, a medical reference work contains a "chapter/section" hierarchy that covers a broad spectrum of subjects. A medical researcher wants to see only the sections relevant to a specific diagnosis or procedure gathered from an electronic library and combined with public web information. There is almost no utility in returning the URL to a 2,000-page medical reference work. A similar argument applies, for example, to large corporate legal documents and government regulatory documents.

Categorically Speaking

People are accustomed to thinking in categories, and enterprise search products generally include taxonomy and classification subsystems. Metadata records containing document classification, document descriptions or digests, and some subset of Dublin Core metadata elements are used to qualify search results and allow the search user to rapidly focus results. In the XML search and query environment, best practices allow the user to select any document element as a possible dimension for metadata-guided search or classification.

For example, contract documents could contain markup capturing names of the parties to the contract, dates, definitions, provisions, footnotes, exhibits and possibly even individual sentences within paragraphs. A search across a collection of contracts could be focused on any one of these elements. For example, a user might request an inventory of all extant provisions relevant to limitation of liability involving sums larger than $1M, with the results classified by state governing the law, (e.g. California, Texas, New York, etc.).

Good relevance algorithms include term frequency counts, stemming, thesaurus-based query expansion and phrase-query evaluation using term proximity and relevance score accrual. In the XML environment, the same methods are needed, but they should be made available at all levels of granularity in the document structure and they should allow the markup structure to participate in the relevance computation. For example, a search user should be able to specify that the presence of a "summary" or "abstract" structure should contribute to relevance in the same way that any other term contributes to relevance. Best practices allow relevance based on XPath location path expression like ‘/section/footnote', and allow users to specify search for any specific structure. The relevance algorithm should take into account the frequency of occurrence of this structure, as well as the frequency of terms or phrases within the structure.

Relevance tuning allows search engines to respond to customer-specific requirements for document selection and relevance ordering. Traditional search engines allow ad hoc inclusion of term context (title, heading, etc.), location (near the top), and page popularity/authority measures in the tuning process. Best practices in XML search allow the search application designer to access any element of structure for inclusion in the relevance-tuning process. For example, relevance could be tuned based on the appearance of a search term in the title child element of one of the first two section children of the document root element. Constraints of this kind can be combined with general parametric queries targeted to any element or attribute appearing in the document. For example, a further tuning constraint might force to the top of the search-result list document fragments that contain table elements, with a column header labeled "price," having a cell within that column whose value exceeds $10K on a row labeled "license."

Best practices in enterprise search and query drive a convergence of new search and traditional database access paradigms.


Paul Pedersen has more than 25 years of experience in the software industry. Prior to co-founding Mark Logic, Pedersen held senior leadership positions at Inktomi, Google and Infoseek. In addition, Pedersen has extensive experience in bioinformatics and expert systems for financial services. While working at Kidder-Peabody in 1986, Pedersen helped to build one of the world's first program trading systems on Wall Street.

Mark Logic, an XML content server provider, helps information-product providers accelerate new product creation, build custom publishing systems, deliver products through multiple channels, integrate content from different sources, repurpose content into multiple products and mine content to find previously undiscovered information. MarkLogic Server does this by enabling companies to query, manipulate and render XML content. MarkLogic Server delivers millisecond response times against multi-terabyte content bases. For more information, visit Mark Logic or call 650-655-2300

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues