Vote Now for the KMWorld Readers' Choice Awards !

Searching across the enterprise

This article appears in the issue September 2001 [Volume 10, Issue 8]

The software in demand today creates and manages knowledge collections by acquiring information from such diverse sources as Web crawlers or authoring tools, then indexing and organizing it in a structure that is pertinent and understandable to the organization. The need to maintain document versions and archives and, more importantly, to make the information readily accessible through advanced, easy-to-use search and browsing tools is driving sales of content management and enterprise search and retrieval technology. IDC terms this collection of software information system infrastructure software.

As people become more sophisticated in their use of new retrieval tools, we can expect their queries to become more elaborate and specialized. While documents will still be retrieved by author, number or date, we can also expect that range searching by date or numbers may be needed. Good systems will eventually understand the context in which searchers are asking questions so that a search on "ABC" will be interpreted as either "activity-based costing" or a major television network, depending on the context of the question. Subject queries should become more complex (e.g., "methods for placing a value on intellectual assets"), or they may demand an answer instead of a set of documents (e.g., "Who was director of marketing in 1994?").

Challenges A robust information system infrastructure must support a single point of access to one virtual data store. Otherwise, employees run the risk of forgetting or ignoring important information sources, thus missing key data that might swing a decision one way or another. Unfortunately, keeping track of information within a large enterprise is a staggering task. Each department or business process generates collections that are unique in their structure and content.

Information systems are complex. Designing a system that searches well across disparate types of data has become one of the greatest challenges facing information scientists. For example, while it may be easy to compare two documents to see which is the most relevant to a query, it is no simple matter to decide what the value of a single fact from a customer relationship management (CRM) database is when it is weighed against a paragraph from a 10-page, full-text document. Which should be served first to the user? Can the system provide alternate views to each individual, depending on his or her role in the company? Can weights be changed to reflect how the company uses data in its decision-making process? Is the system capable of providing several views into the same data so that it can be examined from different angles? Can access to sensitive financial or personal information be limited? That complexity demands more than a basic search system. Enterprises need high-end tools to support involved information needs. Without them, the data analysis upon which good decision-making rests is dangerously incomplete. Yet, keeping up with new retrieval technologies requires a depth of understanding that is outside the expertise or interests of most organizations. Investment in the wrong technology dooms employees to unmitigated frustration. Worse, lost productivity and poor decisions are the result. Enterprises need advanced systems that can be installed and used with minimal knowledge or effort.

The ideal information system infrastructure

A well-functioning information system infrastructure stands at the heart of today's knowledge-centric enterprise. It must include the following elements in order to invite use and collaboration among users:

Search Good retrieval is an elusive goal. It depends not only on the sophistication of the search engine but also on the preparation of the information, the underlying structure of contents and the index. In addition, the aptitude, skills and patience of the users can determine success or failure. Good search is crucial to any information system, but in an enterprise, good search may mean the difference between making well-grounded decisions and making poor ones. Search tools have a quantifiable impact on the success of the organization. A good search system returns the information that's needed, when it's needed. While "good" is impossible to define functionally, technical capabilities contribute to making a search system a good one. Effective search and retrieval software must do the following:

  • return the most relevant documents first,;

  • search across multiple formats and repositories, ;

  • find documents about a topic even if the author's and the searcher's terminology doesn't match exactly, and;

  • alert users to new information on topics they are monitoring.;

Relevance ranking Proper relevance ranking determines whether a user feels that a system is effective. If the system finds relevant documents but does not present them early in the search results, the user feels that his or her question has not been interpreted properly or that the system is returning materials haphazardly. Yet, relevance ranking is somewhat in the eye of the beholder. Based on a one-, two-, or three-word query, it is difficult to infer what will be most useful to a particular searcher.

Most systems determine relevance by counting the number of times a query term appears in a document. Those documents that have the most occurrences of the most query terms are deemed to be most relevant. However, other factors must also be considered. For instance, long documents have more words in them, so the number of occurrences must be averaged over the length of each document. The location of the words in the document may determine how important they are to that text. If they appear in the title, the lead paragraph or the conclusion, they are more likely to be important than if they appear in the middle of the text in an illustrative anecdote. Many systems take those factors into account.

Relevance ranking is sorely tested when it must return materials of unequal length or different structures. For example, words in a billeted list should probably carry more weight than words in a long paragraph. Text in captions for illustrations or within a table is usually quite brief, but it is probably worth more in determining the importance of a subject than an entire paragraph of unstructured text.

Systems must compare the content of a PowerPoint presentation with that of a technical report or a news release. A good system would examine the type of document and adjust the relevance weighting for document type and content. The importance of each of varies within the context in which the search is performed. Ranking algorithms should be adaptable to the priorities of the organization.

Integrated search Most knowledge-centered enterprises have amassed a diverse collection of sources. Those sources comprise text documents in various formats, presentations, training materials, graphics and illustrations, Web pages, structured databases and the contents of CRM and enterprise resource planning (ERP) systems. They are typically distributed all over the company.

Two streams of external content need to be accessed by the user. The first is high-quality, carefully selected content such as market reports, technical reports, business intelligence and industry studies. That content must be selected and licensed for internal distribution, and it typically carries a high price tag. The second stream is Web content that is pertinent to the enterprise. That content may include competitors' Web sites, product announcements, research in specific areas, or new advances in CRM or human resources.

While access to the entire riches of the Web should be available, delivery of focused content ensures better retrieval of the most important materials. Therefore, any system should be able to configure crawlers for tailored Web coverage. The ideal information system would be able to search across all content sources and all formats to deliver a single answer to a query.

Language Search systems match questions to textual content. Typically, more than 75% of the materials that an organization churns out are text documents. Since language is notoriously imprecise, perfect matching is a tall order. The problem is compounded by users who rarely ask the right question to retrieve the information they need. While it is difficult to ask a question about an unfamiliar subject, matching marginally related query terms to the precise language of a subject expert makes it even more difficult to achieve a perfect match. In addition, there are myriad ways to express the same idea. Any search system that can offer synonyms as well as clues about related topics and terminology immediately improves the likelihood of searching success.

Some search systems ignore the language problem and return exact matches or nothing. Others measure similarity between documents and queries statistically. Those approaches are inadequate for any knowledge-centered organization. Instead, the search system should incorporate advanced retrieval capabilities that do the following:

  • Automatically identify phrases in the query and match them against phrases in documents. (Phrases are better carriers of meaning than are separate words. Automatic use of phrase searching eliminates many poor or puzzling search results.);

  • Identify name and term variants and map them to a single concept.;

  • Eliminate documents that contain the right word with the wrong meaning. (Since most words have more than one meaning, exact-match systems often return documents that are not relevant but that contain the terms in the query. Systems that classify documents by topic or that examine the context of a query and a document to determine the concept or idea improve the quality of the results.) ;

Appropriate file structure Systems that retrieve unstructured text, such as documents and reports, require a different functionality and structure than those that deal primarily with structured data. To search large collections of text well, a system must be able to search across entire documents at once, including the metadata and illustrations. When those elements are stored separately, relevance ranking is impaired. Calculations must be performed on each of the separate elements, and then the results must be joined. In a small system, this is not a problem, but text-based systems grow quickly, and extensive processing slows them down.

In addition, the unpredictable nature of unstructured text makes it a poor candidate for relational database structuring. Inevitably, the text is placed in a table as a single object, obviating the usefulness of relational database operations. Information system infrastructure must support such searches as:

  • What are the current best practices for risk assessment in the insurance industry? ;

  • Notify me when the following companies release a new product.;

  • Find all instances of airline mergers that have not been approved.;

Large databases present unique problems to an information system. So much data may slow down a system unless it is optimized to handle large amounts of text as well as many simultaneous queries. Large text databases require specialized file structures to enable searching by proximity, phrase identification, entity identification and other linguistic features. Distributed databases must be searched simultaneously and then have their separate results merged and compared for relevance to the query.

Large databases by their very nature return large numbers of documents in response to a query. Those documents must be presented to the user so that the entire retrieved set can be comprehended at a glance and navigated easily. Their size mandates that they will be slower to search unless searching techniques are optimized.

Alerting features Most employees have continuing needs for the same kind of information: "What are new techniques for risk assessment?"; "Send me all changes in management at my competitors' companies"; "Notify me about new trends in benefits packages"; and "Alert me to any new research on theophylline." A good information system should be able to collect preset profiles of employees' interests and run them against all incoming new materials. It should also be able to establish a repeating query from any query that a user stipulates without requiring lengthy procedures or forms. Categorization and clusters of related terms

Good categorization creates the foundation for a good search system. Categorization adds indexing information to documents as they are entered into the system. That indexing metadata should include title, author, date, format or document type, as well as other identifying data. It should also include subject headings so that documents about a single concept are grouped together. The process must be automatic in order to avoid bottlenecks. Additionally, it must be intelligent enough to avoid "dumb computer" errors.

Using categorization, a search system can find related materials even if all the query terms are not in the document. Good categorization also builds clusters of related terms and synonyms. Those are used to find documents that may be about a topic even if the searcher doesn't use the most relevant terms in the query. Clusters of synonyms can be used to expand a query to enrich it with appropriate additional terms.

Categorization can be used to create "instant directories" that show the user the contents of a set of search results by categories so that the most useful results can be found quickly. Most important, categorization enables a search system to hone its results, eliminating many of the irrelevant hits that plague systems that do not categorize. Directories are an important entry point into a collection of invisible documents. Directory building also requires categorization. Since some users prefer to browse rather than search, directories are valuable information-finding tools that improve any information system.

Usability and good interface design Usability is a critical component of any system. Design, reliability, predictability and ability to understand are all facets of usability. A system must make its features known and understandable so that users can comprehend how to use it, as well as how it works. A system must be available when needed, and it must return the same answers to the same question, barring updates to content or changes in context. Usability requires an easy-to-use interface that will accept all kinds of queries--questions, sentences, keywords, or Boolean commands--and interpret them as they are intended to be interpreted. The interface must be easy to navigate and offer visual navigation features and multiple views for filling different kinds of information needs. Ease of use is important for users, system administrators and content contributors.

Good information finding requires many kinds of functionality within a system if it is to serve the varied needs of an organization. Both browsing and searching should be supported. While searching is best for finding answers to specific questions, it does not answer questions such as "What is in this information system?" or "What marketing materials or company profiles have already been written that I can use for my next press release?"

Thus, a good information system should support both browsing and searching by creating and maintaining a directory of information that can be viewed from several perspectives (e.g., by subject, type of materials, author or date). Based on the extensive categorization described above, the directory should present materials in a hierarchy that makes sense to the user community.

Flexibility The information usage patterns and needs of any enterprise must be reflected in the structure of its information system. Terminology should match internal use. Relevance ranking must reflect the priorities of the organization. Therefore, any good search system should be adaptable and customizable as follows:

  • The taxonomy must be easy to change and modify. ;

  • Access should be customizable for different parts of the organization based on business rules and the need for sensitive information.;

  • Search boxes should accommodate different types of queries: to retrieve employee records by date or name or number, to ask for facts, to launch Boolean queries or to ask natural language queries of any length. A good system should let the user cut and paste an entire passage into the search box in order to locate the source document as well as any other documents in which it was used or paraphrased. ;

  • Relevance ranking should be adjustable to meet the priorities of the enterprise, giving higher weight to types of documents or repositories that are central to the organization.;

  • Personalization of the interface or the type of information delivered should be possible based on an employee's role in the company or that employee's need for information. ;

  • Customization features should permit changes in the interface design, a choice of which external content to crawl and the opportunity to determine the crawling schedule of both internal and external content. ;

Administration tools Any system requires tools to enable system administrators, content administrators and appropriate staff to organize materials and analyze usage. Those tools include the following:

  • reporting tools for tracking usage, analyzing queries and determining which information sources are of most value to employees;;

  • workflow features to control and track input of and changes to content;;

  • taxonomy construction and maintenance tools; and;

  • versioning and rollback capabilities.;

Security Secure access to sensitive or proprietary information is a major concern for all enterprises. In addition to protecting its intellectual capital, an organization must ensure the privacy of its employees. The problem is compounded by a mobile work force that requires access from within or from outside the firewall. Therefore, information system infrastructure must contain:

  • built-in rights and permissions processes that use established business rules, ;

  • a hacker-resistant architecture, and;

  • secure, private access to data from any location worldwide.;

Scalability

Information systems grow quickly. The more successful they are, the more quickly they grow. What begins as a useful tool for a single department may eventually spread to an entire enterprise as well as to its customers and suppliers. Some information architectures are more scalable than others.

Conclusion The ideal information system infrastructure does not exist. Enterprise needs vary, and so do vendor offerings. The information system is the foundation, infrastructure and heart of any enterprise portal, acting as the single point of access into the infrastructure. However, enterprises may need to add CRM or ERP systems, data mining systems and collaborative applications for conferencing and collaborative work environments.

The degree to which those elements can be integrated easily determines how extensible the system will be. It is vital that the eventual enterprise portal be presented to users as a single, integrated whole. It must be the only logical starting place for working within the enterprise so that pertinent information is found and used as a matter of course.

Brian McDonough is research manager for Knowledge Management and Intranet Strategies with IDC. He can be reached at bmcdonough@idc.com.


Search KMWorld

Connect