Desperately seeking search
Whether on the desktop, the intranet or the Web site, tools that facilitate information finding have become indispensable. Most of us rely on search engines to tell us where our stuff is. Search engines connect customers, partners and employees to the information or the products that they need. Within most enterprises, information is continuously retrieved and then shuffled between software applications. Search and its allied access tools, then, are no longer a luxury for most organizations. Given the need, however, how are we to make sense of a crowded search market?
Unfortunately, no two search engines or information access platforms are exactly alike. Even among plain, unvarnished search engines, the retrieval technologies and ranking algorithms differ, some favoring one kind of search or collection, some another. Broad, shallow collections may need different tools from narrow, deep ones. Large or heavily used collections will have different demands for scalability, performance, categorization or interaction than those that are small or lightly used. It is advisable to start looking for a search engine or platform first by analyzing what you want it to do and whom you want it to serve.
Begin by considering the types of information seeking tasks that the information access platform should support. Some of the information access tasks to consider include:
Each of those tasks requires a slightly different type of software. For instance, browsing requires categorization and/or clustering so that users can explore a topic or collection without having to frame a query. Visualization is particularly helpful for browsing because images convey more information and convey it more quickly than text might. Visualization is also useful for interacting with the results of a search, or for exploring and analyzing large collections of data that are located in a business intelligence or text analytics discovery session.
Different tasks, different needs
Retrieval requires an explicit query, and is the most familiar content access technology--a standard search engine. Specialized forms of retrieval, such as question answering/online technical support, require language analysis and text mining, in addition to search. Categorization is an important element in sharpening retrieval, and it has been incorporated into most of the large information access platforms.
Monitor is a type of retrieval. A search is fixed in time. To make it into an ongoing process, the search must be continuous. Monitoring, also known as "alerting," matches a search against a continuous stream of new information. It pulls out the matches as they arrive and notifies the searcher that there's something new to look at. Once multiple alerts have been set, categorization and visualization become important features, so that the user can see all the new arrivals at a glance without having to check each topic separately.
Discovery and analysis is the newest and least familiar set of tasks to users of search engines. Sometimes called "text analytics," these applications are used to discover "who," "what," "where," "when" and "why," in order to analyze large volumes of text. They identify events, concepts, names of people, places and things, and the relationships among them. Text analytics is the corollary to data mining for unstructured information. Most recently, these technologies have begun to extract "sentiment"—whether a text passage is favorable or negative about the subject. Manufacturers use text-mining applications today to make sense of their large volume of e-mails and other customer interactions. They forage for patterns of customer complaints or suggestions, product failures or patterns in warranty claims. Sentiment analysis can help a manufacturer learn what customers think of its products or help a political party find out the effect of its messages. Government agencies use these applications to discover interactions among suspected criminals or terrorists. Financial institutions may use them for detecting money laundering.
Information seeking is almost always part of a larger task or process that requires moving the information around. Combined with collaborative tools, a good information access platform can facilitate the appropriate distribution of information. Once a useful fact or document is found, it can be distributed through e-mail, RSS feeds, blogs, newsletters, conferencing applications or Web sites.
In the feedback category, we place a growing number of system analysis tools that can be employed to monitor use of the system, discover problem queries or interactions, and then make suggestions on how to improve the system or its content in order to solve those problems. Feedback tools use the byproducts of information management and access like query logs to improve the quality of the performance system. Feedback is used to add missing content to a Web site, to aid in product design by notifying product managers of frequently requested features, or to redesign Web sites in order to get people to the information they are seeking more intuitively. These tools can also be used to upsell and cross sell products that appear to be of interest to the same types of people. Or, to suggest topics that may be of interest to users of the system, and to connect people with similar interests to each other within the enterprise.
Spectrum of technologies
Like all the applications that make up access tools, these tasks have always been done successfully by people. However, it is the amount of information that has made it impossible for one person to manage and comprehend enough. These applications bootstrap the human efforts.
In many ways, these tasks mirror the processes associated with business intelligence tools. One of the major trends we see today is the emergence of a single platform for both content and data. Business intelligence tools are converging with content intelligence, otherwise known as text analytics. Data and content can now be browsed, searched, monitored and analyzed from a single access platform. These platforms will become more prevalent and their functions more integrated in the near future. Some are already available.
It is apparent that search is no longer a single application. It is a spectrum of related technologies that work together to accomplish a wide range of information access tasks. The needs and tasks within one organization may differ between departments, and they certainly differ from one organization to the next. Search is tied to so many complex requirements that the only sure thing is that one size will never fit all. How then does one go about selecting appropriate information access software? Some of the variables to assess before looking at search and information access products include:
- Who are the users? Are they customers? Partners? Internal business users or IT staff? Researchers? Analysts? Marketing professionals? Sales?
Type and number of questions
- What kinds of questions will they ask? Do they need to browse through categories? Are sales being lost because customers aren't finding the products they seek? Do users want to find all the information in the system or just a quick answer? Do they need to search by field, perhaps with drop-down lists (for product categories, for documents by publication number or for flights by destination)?
- How many queries will be received in an hour or a day? If a system is good, it will be used far more than was predicted. Design for increased use and for more types of uses.
Information to be gathered, indexed and searched
- How many documents need to be indexed?
- What kinds of information will be searched for? What file formats need to be indexed? Text? Data in databases? Images? MP3 files? How about using search to index people so that they can be searched for as well?
- Where is that information located? On file servers? In content management repositories? In databases? In e-mail messages and attachments? On desktops? On Web sites? The search engine must be able to index and retrieve all of this information. Federated search is necessary if more than one repository is to be searched. Having separate rules for access or for relevance ranking by collection may be another requirement.
- At what rate are new documents or files added? Information delivery
- In what format will the information need to be delivered? HTML? Plain text? PDF?
- Does access need to be secure? Should access be controlled by role or job title?
- Will remote access to the information be necessary?
- Will users need to receive output on multiple devices, including some with small screen formats, like mobile phones?
- Should the information be protected by digital rights management software to prevent it being copied?
- Tools to tweak relevance ranking to boost the "best" information on a topic to the top of the list, or to promote seasonal products.
- Tools for query log analysis in order to understand what users are seeking and to recommend changes or additions to content to satisfy high volume requests that are not being answered.
- Tools to monitor and optimize hardware and software performance, balance loads.
Every search engine on the market today does a creditable job of matching keywords to documents. Most base their estimate of relevance to a query on a complex mixture of the frequency of the term in the document, where that term appears, whether related terms also occur in that document, etc. But many have included additional features or technologies that go beyond basic matching. This next generation of search engines uses a variety of devices in order to improve the search process. Some of these features are:
- Categorization. By automatically categorizing and adding metatags for subject, date, author, etc. to each document, search engines today attempt to sharpen their results. Categorization adds weight to the major topics in the document, helping a search engine to distinguish between documents that focus on the topic and those that give it only a passing mention. Categorization enables browsing as well.
- Degree of interaction. Sometime in the not-too-distant future, a searcher will be able to have a semi-natural conversation with a search engine, explaining what is being searched for and why. Today, true conversational systems are still a dream, but some vendors are taking a stab at this kind of interaction. We know that users are often confused about how to frame a question or how (and why) the search engine has interpreted it. Search engines that reflect how they have interpreted a query help the user to understand whether the query needs to be rephrased or not.
- Entity extraction. Entities are names of people, places and things. Names are one of the major targets for searchers. Extracting these names and adding them to the metadata helps a search engine locate names accurately.
- Question answering. Specialized search engines used for online self-help, for instance, answer questions directly.
- Text analytics, the text equivalent of data mining, forage in a collection of text to find patterns, trends, emerging problems, risks, non-compliant e-mails, customer dissatisfaction, etc. Combined with business intelligence tools, they are transforming how some of the largest corporations understand and react to their customers.
- Price of the software and the implementation.
- Time from purchase to complete implementation.
- Security of content, of users' identities, of intellectual property.
- Need for reliable, constant access with no interruptions.
- Index freshness--Must new information be searchable immediately? How often is the index refreshed?
All of these last considerations will affect the number of servers required, as well as the system architecture. These and other questions that surface as you assess your information access needs become the foundation of a shopping list that will narrow down the candidates quickly. The next step is to investigate vendors.
The search market in 2005
The search market today is confusing and crowded. That fact makes it hard to select a search engine, but it also favors bargain hunters. We can divide the participants—and there are literally hundreds of them—into a few large groups, with vendors listed alphabetically:
- Major players. Autonomy, FAST and Verity are the largest players in terms of software revenue.
- Specialized applications for e-commerce, online self-help, compliance, text analytics, etc. (see the list below).
Each search engine on the market today offers a different set of features. Many have specialized applications as well. Here are a few, with apologies to those I have omitted. Note that the larger search vendors can do many of these, but I have listed those that are product differentiators. Listed in alphabetical order within each category:
- Search appliances: Low-cost, generally small-scale, plug-and-play application, embedded in its own hardware. Can scale up by adding additional boxes: Google, Search Cacher, Thunderstone.
- Hosted search: Atomz, CrownPeak, FAST.;
- Categorization and specialized taxonomy building tools: Autonomy, Inxight Verity.
- Desktop search: Apple, Autonomy, blinkx, Copernic, dtSearch, Enfish, Google, ISYS, Lycos, MSN, Open Road Technologies (now Intellext), Tukaroo (bought by Ask Jeeves), X1 (x1.com), ZyLAB.
- Platform for information access: Autonomy, FAST, IBM.
- E-commerce search: EasyAsk, Endeca, Siderean.
- Question answering/technical support: InQuira, iPhrase.
- Search add-ons and spin-offs:
1. Visual interface plus search and categorization platform: Autonomy, Inxight.
2. "Search derivative applications": FAST (market intelligence, publishing, localized search for yellow pages, mobile search).
3. Call center—Autonomy's Audentify with speech to text technology.
4. Compliance—Autonomy's Aungate.
5. Rich media search: Autonomy, FAST, LTU, Nexidia .
6. Business intelligence for text: ClearForest, Endeca.
7. Document imaging plus search: ZyLAB.
Determining which of these features are necessary, which are superfluous, and which are merely nice to have, will help to frame the description of requirements for search software. As with anything related to search today, we have ended up with more questions than answers. But asking the right questions, as any information professional can tell you, is more than half the battle.
Susan Feldman is research VP, Content Technologies, IDC (idc.com), e-mail firstname.lastname@example.org.