-->

KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

Meeting Government's Need for Enterprise Search

The government market for enterprise search products has witnessed substantial growth over the past several years, a trend that is projected to continue well into 2005 and beyond. This increased demand for search solutions can be attributed to several major factors:

  • Information overload: The ubiquity of digital communications, personal computers, cheap storage and Web publishing has created an ever-growing glut of digital information. To deal with the huge influx of this digital content, government agencies are looking for innovative solutions that help filter, find, organize and route information to those persons who need it, when they need it.

  • Legislation: The E-Government Act of 2002 is fundamentally reshaping the way government does business. Subsection (207.f.1), for instance, requires all federal agencies by the end of 2004 to have committed a strategy and timetable for making pertinent public information available in electronic form. Effective search can improve the accessibility of published information by reducing the time spent by users to find relevant information.

  • Evolving threats: The events of 9/11 and subsequent war on terror has forced the intelligence, law enforcement and defense departments to rethink the ways they gather, re-use and share information both within and across organizations. Effective search technologies play a critical role in helping analysts connect the dots.

  • Technology-savvy users: A Web-savvy population, coupled with the popularity of Web search engines, has turned "search" from a skill used primarily by librarians and researchers to an everyday household activity. Both government employees and the general public have come to rely on search as a reliable tool to find the information they need quickly.

For government organizations starting to deal with these issues, choosing the right search solution can be challenging. From search appliances to question-answer systems, the choices seem endless. To complicate matters, many vendors claim to support similar features but differ greatly in their approach and level of coverage. To ensure successful deployment, organizations need to move past the product literature and high-level feature checklists to gain understanding of the fundamentals of enterprise search and how it applies to their environment.

Enterprise Search vs. Web Search

For many people, the term "search" has become synonymous with services such as Google, Yahoo and MSN. A user can enter a few keywords into a search box, press a button, and usually find some relevant Web pages within a few clicks. The buzz generated around these businesses has led some IT decision-makers to assume that the same approaches and technologies for Web search can be directly applied within an enterprise. Closer examination shows that the spaces are quite distinct. To see why, let's look at few of the underlying assumptions of Web-search.

One significant difference between enterprise and Web search involves the nature of the content being searched over. Web search companies have invested heavily in technologies to address filtering of spam, promotion of paid sponsored links and development of algorithms that depend primarily on the social popularity of Web sites. While effective over the entire World Wide Web, these techniques provide little or no advantage when applied to enterprise assets such as office documents, database tables, XML records or internal Web sites. Second, simple keyword searching works well in most Web search engines because chances are extremely high that someone on the WWW has already constructed a Web page that uses the exact keywords specified by the user. The scale of the Web and social network effect contributes to the impression that the highly relevant results are always returned. A more accurate description of what is going on is that users usually get an acceptable answer, not necessarily the best or complete answer. Users rarely realize this is because of the scale and patchy nature of the Web. For enterprise search, different approaches beyond keyword search and popularity-based ranking are required to get the best documents back and to ensure higher recall of all the relevant information to a query.

Data Aggregation

In most organizations, critical business information is spread across file-servers, Web-servers, databases, desktops, content management systems, e-mail, collaboration servers and other business applications. Most of these systems limit searching to content managed locally on the server and provide very basic search capabilities. An effective enterprise search solution can often solve this problem by providing a single access point that spans across multiple heterogeneous repositories. The decision of which content repositories/data to index is highly dependent on the business problem being solved. For example, let's say a regulatory agency wants to reduce the time it takes their agents to research a complaint. By indexing the e-mail server mailbox where complaints are received, an e-mail archive of historical complaints and a database where investigative evidence is stored, agents would be able to quickly piece together all relevant information about a case. When evaluating vendor's capabilities in this area, data-integration, issues to consider are:

  • Does the vendor support the content repositories you plan to index?

  • Does the solution deal well with indexing databases?

  • Does the solution use native APIs to connect to the repository?

  • How quickly are changes in the underlying repository reflected in index?

  • Does the vendor provide a robust, scalable spider to crawl internal Web sites?

  • Does the solution allow you to configure which custom metadata is retrieved from each repository?

  • Does the solution provide a flexible API to allow integration into legacy IT applications?

  • Does the solution support hierarchical objects: embedded files, zips or multi-page scanned documents?

Note that indexing solutions that rely primarily on Web crawling instead of native APIs will be limited to finding assets accessible through a Web page and have very limited flexibility in dealing with metadata.

Security

While the power of search stems from its ability to aggregate data, along with its benefits come some risks. The highly publicized incidents of personal data theft from companies such as Seisint and Choicepoint illustrate the problems that occur if appropriate policies, procedures and technologies are not in place to protect access to information. Clearly, strong security capabilities must be a cornerstone of every enterprise search solution. Issues to consider are:

  • Does the solution use industry standard authentication and access control protocols?

  • Does the solution protect server-to-server communication?

  • Does the solution provide flexible policies for choosing what is indexed, what is not indexed and where indexes are stored?

  • Can the system overlay additional access control policies for logical groups of aggregated data?

  • Does the solution provide access control down to the user and document level?

  • Does the solution integrate with IT directory infrastructures used for managing users and group membership (e.g. LDAP, Active Directory, Custom)?

  • Can the solution use document access control attributes stored in the repository?

  • Can the solution deal with multiple authentication and access control systems available for each repository?

  • Can the security system be extended to support custom access control schemes?

  • Does the system provide sufficient audit and logging capabilities?

Carefully evaluate solutions to determine if vendors have added security as an after-thought. These add-on approaches can leave security holes or result in significant performance impacts when security is turned on.

Metadata Search

While keyword searching represents a base capability available in all search engines, more sophisticated search strategies are often called for. In many circumstances, end-users can more quickly locate relevant information assets by using attributes of the content rather than simple keywords applied directly to the document contents. To enable powerful metadata search, solutions should be evaluated on their ability to support:

  • Indexing of multiple metadata schema in the same repository;

  • Mapping of metadata schema across repositories

  • Powerful parser scripting language to easily extract metadata from unstructured documents;

  • Indexing, extraction and search of metadata stored in XML records;

  • Natural language, integer, floating point and date searching over metadata fields;

  • Exact match as well as range searches for numbers and dates;

  • The ability to aggregate metadata stored in separate repositories at runtime (e.g., a system should be able to index documents in a file system while using metadata attributes stored in separate RDBMSs);

  • Query operators such as AND, NOT, OR and field weighting across the document body and across fields; and

  • Incremental indexing for a document whose useful lifetime is short or metadata changes often.

Text Mining and Advanced Discovery

Aside from the modes of searching described so far, many civilian regulatory, law enforcement and intelligence organizations need sophisticated discovery capabilities around the area of text mining. Let's say, for example, an intelligence analyst is given the task to scour large amounts of raw intelligence data to determine the most likely location of a wanted terrorist. Obviously, using a simple keyword query such as "Where is terrorist XYZ?" will be fruitless. The clues to the answer may be spread across multiple nuggets of information that only a human analyst can reasonably piece together. The purpose of the search application, then, is to provide a set of tools to enable analysts to efficiently explore the information space, filter out noise, connect the dots and see a pattern. Many different questions will be asked, which over time will help the analyst piece together the puzzle. For these types of applications, the analysts require a wide range of capabilities including:

  • High-performance and scalable system that supports complex queries (thousands of terms and operators);

  • Ability to work with extremely large sets of results;

  • Storing of results and queries for reuse and refinement;

  • Filtering huge amounts of information, signal detection and notifying users of new relevant information;

  • A powerful query language with, for instance, Boolean, proximity and adjacency, wildcard and pattern-matching operators;

  • Support for multiple languages, diverse document formats and noisy data;

  • Easy integration with other analysis tools and repositories;

  • Ability to slice and dice data to see the big picture and quickly drill down to relevant information;

  • Support for searching over time-based multimedia assets;

  • Support for multiple languages and cross-lingual search;

  • Customizable domain-specific knowledge resources;

  • Entity extraction; and

  • Co-occurrence detection.

The Change to E-government

The push to e-government and the ubiquity of digital information is forcing agencies all across the government to re-evaluate the way they get business done on a daily basis. Enterprise search solutions can introduce significant efficiencies and cost savings by helping agencies filter, find, organize and route critical information to those persons who need it, when they need it. When selecting search solutions, it is important to understand that different business problems require fundamentally different approaches to search.


Sameer Kalbag is director of product management at Convera Corporation. He directs Convera's product line in the areas of index and search, system architecture, security and APIs. He has a degree in computer science from Cornell University and has worked in areas related to information retrieval, information security and system architecture for the past 10 years.

Convera is a leading provider of mission-critical enterprise search and categorization solutions. Convera's RetrievalWare solutions provide highly scalable, fast, accurate and secure search across 200 forms of information, in 45 languages. More than 800 customers in 33 countries rely on Convera's search solutions to power a broad range of mission-critical applications. For more information, please visit www.convera.com

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues