-->

KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

A critical evaluation of MOSS search

In other words, you’d want to commit to the full MOSS 2007 platform to employ the broadest of the three search services.

SharePoint Server for Search, as the most standalone offering, is a product that could theoretically provide you with a relatively easily deployable, Microsoft-based solution. It requires a Microsoft Windows 2003 server, but no separately licensed products. In theory, it could be deployed as an appliance--a prepackaged Windows blade server with SharePoint Server for Search installed, ready to be configured from the Web interface. In fact, a U.K.-based company announced the "Scan Orange Spider" as a direct Google competitor. The main problem there would be licensing (which is so complicated you’ll have to check with a specialist to see whether your existing licensing would cover such an appliance), though, in theory, a SharePoint Appliance could be more capable than Google’s boxes, mainly in support for security features.

Technology

A diagram of SharePoint Search is easily drawn, and will display the main ingredients of any search engine: Content sources are crawled, indexed and stored in an index; then searchers’ queries will be passed to the query engine, which in turn will access the index and return results. As with most of the infrastructure vendors, however, reality isn’t that simple, and Microsoft draws on a sizable number of "generic" modules also seen elsewhere in its product lineup.

As with Oracle SES, SharePoint Search does a fairly good job of installing everything you need for search. On the one hand, it depends on many more pre-installed server components (Microsoft’s IIS Web server, .NET Framework and Windows Workflow Foundation, to name but a few). On the other hand, Microsoft has, over the years, also managed to make those components more "generic" and logically interconnected than Oracle has. Oracle SES is easier to install as a standalone, closed box than SharePoint Search, but once the size of the implementation increases beyond the departmental level, that isn’t the main criterion anymore. The required expertise of the underlying platform will come into play, and that is more a matter of what you or your integrator are comfortable with while keeping the engine humming at higher revs.

Scaling a SharePoint Search implementation is not a difficult process; the basic install assumes you’ll be running the crawling, indexing and query engines on the same server, but it’s relatively easy to divide it into crawling/indexing on one server and index/query on another without too much detailed knowledge of the platform. With more expertise, it’s possible to set up a completely custom architecture, running separate components (such as Web interfaces) on server farms. But if you limit it to clustering query servers, the process is relatively straightforward. The search service will continually propagate its index to all query servers when content is crawled and indexed.

Support for security features is fair. SharePoint will cache access rights tokens while crawling, and filter the results according to authorization at query time (early-binding security). That is, of course, mostly limited to Windows authorization and access schemes. Lotus Notes ACLs will have to be mapped, as will rights to business data sources. In that respect, it is comparable to Coveo’s features, and surpasses the rather weak HTTP mechanism Google’s Appliance will employ, although it remains no match for Oracle’s early- and late-binding capabilities or Autonomy’s advanced options.

Content collection

MOSS, by default, can crawl SharePoint sites, Web sites, file shares, Exchange public folders, Lotus Notes and Business Data (in the Enterprise Edition only). Setting up content sources is a fairly straightforward process if your requirements are basic; getting a crawl of a Web site going is a matter of specifying the URLs to start crawling, and the depth and the number of sites to traverse. "Crawler impact rules" will help restrict the load the process puts on source content servers.

SharePoint becomes a lot harder to configure if there’s anything more specific you’d want to define. For example, if you want to authenticate the crawler with anything other than NTLM security (so you can spider external password protected files), using forms or cookie-based authentication is possible--but you’ll have to use a command line tool, which imports an XML configuration file, to define the source.

SharePoint sites, of course, are assumed to be the main source of information, and the instance you’ll be running search on will be added as a default source. Interestingly, though, the crawler doesn’t connect directly to the content database, but instead crawls the front-end Web servers. In a large SharePoint implementation, that could put a huge load on your Web servers, and in those circumstances, Microsoft recommends you set up a dedicated Web site front end to crawl.

In converting source documents, Microsoft is one of the few CMS Watch evaluates that supply their own technology, "IFilters." (Other vendors will mostly either use Stellent’s--now Oracle’s--filters or the KeyView filters now owned by Autonomy.) The default set of IFilters supports 50 different document types (including most common file types and, unsurprisingly, most of Microsoft’s own file formats). The same technology is used in Vista’s embedded search. (If you’re running Vista on your computer, you can have a look at the filters to see how they perform.) If you want to convert exotic file types, some third-party filters are available or you could develop your own. That isn’t particularly difficult to accomplish, but will, of course, require familiarity with both Microsoft’s environment and the format of the files (which might be hard to discover). You should carefully audit your content for such exotica before committing to SharePoint search.

Query processing

SharePoint Search’s query processing is basic, even more so than with Oracle SES. The "+" and "-" operators and quoted phrases are supported, but even wildcard search is lacking [which has allowed Mondosoft to create the free "Ontolica Wildcard" product, adding just that, to lure customers to its full "Ontolica for MOSS 2007" product]. That, of course, makes sense within MOSS 2007 (where keyword search is just one of many ways of navigating the SharePoint environment), but might validate looking into alternative search engines if you find your users are prone to precisely constructing their requests.

 

 

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues