Registration is now open for KMWorld 2019. Register now to join us Nov 4 - 7 in Washington, D.C.

Universal, federated or unified search in the land of information silos

This article appears in the issue January 2015, [Vol 24, Issue 1]
Page 1 of 2 next >>

You talk to your mobile phone to search, don’t you? My wife, who is no technology lover, does. Our 75-year-old neighbor thinks Siri, Apple’s voice search system, is a real person. Ask a question on a voice search-equipped Android phone, and you can get specific driving directions to the closest pizza parlor or gas station. The future has arrived—or has it?

Innovative technology performs speech-to-text conversion. The system makes use of information provided by the device—for example, the user’s location. The search system may look at the user’s automatically generated profile of past behavior. The query is passed against a subset of indexed content narrowed and ranked based on the metadata gathered by the search system. The magic of voice search boils down to orchestrated operations and a method for providing content from a relatively modest corpus. The system does not tap “all” of the information available in multiple indexes. Siri ignores some indexed information completely. Who wants a video of a pizza? If I’m looking for pizza, I want a pizza. In an organization, the businessperson wants an answer, not a list of possible answers.

Information access and retrieval within most organizations is a work in progress. How many search systems does your organization make available? There might be a general search system for marketing information, and probably one or more database search systems. The larger the organization, the greater the number of information retrieval systems. Each laptop and mobile device has a search system. Mobile phone apps sport their own search systems. The lawyers in an organization may have different search systems for specific types of legal matters. The enterprise resource planning (ERP) users have a search system. When it comes to enterprise search, silos abound.

A “silo” is a content collection available to certain users. Who feels comfortable with employee health and salary data running on the search system in the sales department? In the face of the reality of silos, vendors of information access systems beat their drums loudly for the impractical idea of providing access to “all” information. “All” may not mean all or even some available information. Big data is easy to talk about but difficult to make accessible. The same challenge arises for images, audio recordings and engineering drawings with details tucked into the proprietary system’s database.

Early information retrieval

Archive.org consists of more than 10 petabytes of data. The content includes e-books, videos and digital images. A leader of the Internet Archive is Brewster Kahle, who developed the Wide Area Information Server (WAIS) and sold the server technology to AOL. WAIS was the precursor for information retrieval systems that could use metadata to permit such functions as faceting—that is, a hot link that directs the user to related content.

The Archive.org home page uses an approach to content that is not federated. Archive.org implements what design experts call infinite scrolling. The interface offers on the righthand side of the display facets (hot links) that allow the content to be narrowed by text, audio, movies and software. Below the facets, the interface displays a list of topics—for example, television, U.S. patents and radio, among others. The functionality of metadata pioneered by Kahle and his colleagues at Thinking Machines Corporation in the 1980s are integrated, in part, in the Archive.org interface.

What about the search box? I entered the query “knowledge management” in the search box. The results list presents links on an infinite scrolling page.

The approach provides a version of what is variously called universal, unified or federated search. The term metasearch is often used to describe an integrating function that passes the user’s query across discrete content indexes and returns a single results list to the user. Endeca, Inxight Software, Northern Light, Sagemaker and Vivisimo were content processing vendors that helped promote this approach to information retrieval in the enterprise. Endeca trademarked “Guided Navigation” and enjoyed a marketing advantage until the phrase “faceted navigation” gained currency.

A key development was the use of the word “discovery” to explain how the supplemental hot links assisted a person looking for information. The initial query might not unlock the information stored in the system’s index. The facets, topics and suggests made it easy for the user to click through the links without having to craft additional queries.

Formats galore

Behind the curtains, federating search results requires some housekeeping. A user does not want to know the file format in which the information he or she needs is stored. The user wants answers. Early federating systems like WAIS relied on standards for content representation. Today, however, there are many “standards,” and content processing systems must be able to process content in the hundreds of formats found in organizations. In addition, each Internet service that gains popularity may introduce a specialized format.

In the intelligence community, the ANB or Analyst Notebook format is the property of IBM i2. Google uses different file formats for its content. Search results may be a variant of XML, but Google Maps outputs are presented in KML (Keyhole Markup Language). To deliver a “universal search” experience, Google has elected to provide basic information about content in Google Web search. To dig into a particular content domain, Google requires the user to navigate to a specific collection of information and run the query there. The “universal” in Google Universal Search is a fiction, not a practical reality.

If you want to run a query for Google News, you navigate to the Google News page and run the query in the search box on that page. To get access to the blog content that is part of Google News, you need to run a query, look for the option “Search tools,” click it, then click on “All news” on a secondary navigation bar, and finally select “Blogs.” The same situation arises when one searches for images, videos, scholarly books, journals, patents and other collections of content on Google. In my view, no Universal Search is available via Google. By extension, the Google Appliance shares the same characteristics except the licensee has to specify what textual and data content will be processed. Images and videos, engineering drawings, maps and other content types are not supported.

Most enterprise information retrieval vendors are in the same boat. Talk about unified, federated and metasearch is much easier than delivering a system that makes an organization’s disparate types of digital content available. Most vendors’ systems exclude video streams from the index. If video is indexed, the system processes the text included in the digital file or indexing provided by the video owner. If Google cannot crack the “universal search” nut, how can a non-governmental organization or the midsize insurance company, local hospital or state university?

Four barriers

The barriers to unified, federated or integrated search are high. These are what I call silo force fields. Here are four that some organizations face:

  • Some digital content cannot be included in a general purpose search system for security, business or legal reasons.
  • Technical content such as chemical structure information at a pharmaceutical company requires special purpose systems. The same specialist need applies to product manufacturing data, legal information and engineering drawings.
  • The cost of creating connectors to hook into certain content types is too great, or license fees are required to gain access to the file formats. I2 sued Palantir over that type of file format access issue in 2010.
  • The computational burden required to process certain types of content exceeds the organization’s ability to fund the content processing. Big data, for example, requires a computing capability able to handle the Twitter stream, RSS feeds and telemetry data from tracking devices. Cost shuts the door on the “all” concept quickly.

Page 1 of 2 next >>

Search KMWorld

Connect