By Steve Barth
Last month I talked about the importance of retrieving explicit knowledge by starting with personal documents and spiraling out through community and corporate repositories before heading for the information outback of the Internet. Ultimately, however, that’s where many searches will still wind up. While there may be a higher knowledge-value-per-page to your own documents, the sheer size of the World Wide Web raises the probability that you will find some valuable new knowledge among the trillions of publicly accessible pages posted to date.
Of course, the key to realizing that value without unreasonable investments of time and energy is the efficiency and effectiveness of your Web searches. Getting the right answers depends on asking the right questions.
Not all Web content is created equal
Web searching starts with understanding how the Web works, especially in terms of what you are looking for. For this purpose, the University of Albany has a nice, concise guide to Internet research at http://library.albany.edu/internet.
Google claims to be the largest search site, with more than 2 billion pages indexed. As we’ll see, that’s just the tip of the iceberg, but let’s start with the limitations of even the best online search tools.
There are huge differences in the accuracy, objectivity and relevance of Internet documents. The Internet is very different from a bookstore, newsstand or public library, where much of the information has at least been vetted in some way by professional processes. Major newspapers still require reporters to confirm facts; academic authors go through the peer review process. On the other hand, anyone can self-publish professional-looking pages to the Internet. Appearances can be deceiving as to the quality of information, even without considering the deliberately misleading or misrepresented material that causes so much occasional mischief.
The University of Florida recommends starting with some basic questions to evaluate search results:
- Can you vet the author or creator of the page?;
- Is the page on a Web site published by a reliable source?;
- What are the biases and motivations of the publisher?;
- How well is the site maintained and how often is it updated?;
- Is the information current (if you need current information)?;
The depths of cyberspace
The Web might seem infinite, but there are limits to what any search engine can retrieve. Even the best individual Web crawlers, such as Google or FAST’s AlltheWeb , index only a portion of the available Web. They are often limited to pages hyperlinked from other pages; they only index a portion of the pages on each site; and they all struggle valiantly to keep up with the millions of new pages posted daily.
Comprehensive Internet research means tapping multiple search sources. Instead of going from one search engine to another, this can be accomplished more easily using “metasearch” sites, such as Dogpile or Webcrawler, that tap into multiple search engines from a single query. Desktop metasearch applications such as those listed below can add useful functionality to the process.
But there is still more to the Net, because not every document on the Web can be indexed at all. Many pages are dynamically generated from databases that have their own search protocols. Because that information is invisible to most search engines, it has been called the “Invisible Web.”
BrightPlanet estimates that what it calls the “Deep Web” is 500 times larger than the surface Web and is growing at a much faster pace. But because Deep Web content is narrower in focus and purpose, its quality can be much higher. Among the hundreds of thousands of database-driven sites are some of the most valuable and authoritative sources on the Web, e-commerce sites, news publications, message boards and discussion groups, etc. BrightPlanet, Intelliseek (intelliseek.com) and others make such content accessible by writing specific query tools for each source.
For power research, metasearch applications offer elaborate toolsets for accessing and processing information on the World Wide Web. There are dozens of applications in the metasearch category, but the best offer a blend of efficiency, effectiveness and useful extras:
- Query multiple search engines with a single click;
- Consolidate multiple hits for the same page;
- Set up trackers to update search results periodically;
- Verify URLs and remove dead links;
- Have built-in viewers with search terms highlighted;
- Include deep/invisible Web sources in search;
- Handle complex and natural language queries;
- Offer advanced filtering;
- Summaries of retrieved pages;
- Offline viewing of downloaded or capture pages ;
- Trackers for automatic, periodic updates;
Based on what has worked best for me. BullsEye has consistently been my personal preference in metasearch tools. But in reviewing the latest versions of these and other applications, I found merits to each.
BullsEye Pro from Intelliseek (intelliseek.com), $49 to $199Copernic Pro from Copernic (copernic.com), free to $79LexiBot from BrightPlanet (lexibot.com), free to $289NetBrilliant from Tenebril (netbrilliant.com), free
Because I’ll usually test the waters with a quick search on Google or AlltheWeb, I like to set my metasearch parameters broadly. That results in searches that take longer and reveal more hits than some might have patience for, but hey, go get a cup of coffee. I still find that human filtering, while drinking the coffee, is still the fastest and most effective way to crunch through the last 100 hits or so.
Don’t forget that the Web is also a source of human knowledge. Searching documents and discussion doesn’t just lead you to the words but frequently can also link you to the people who wrote those words.
Steve Barth writes and speaks frequently about KM, e-mail firstname.lastname@example.org. For more on personal knowledge management, see his Web site global-insight.com