Roundtable discussion: Enterprise search, Part 1
KMWorld recently hosted a roundtable discussion that focused on enterprise search. Led by KMWorld senior writer Judith Lamont, the roundtable included Eileen Quam, information architect at the Minnesota Office of Technology
; Hadley Reynolds, VP and director of research at Delphi Group
; and Andy Feit, senior VP of marketing at Verity
. Lamont: How did the use of a search engine evolve for the North Star Web site?
Quam: In 1998, I was working in the Department of Natural Resources (DNR) on a project called the Foundations Project to bring the Web sites for Minnesota's 13 environmental agencies through one gateway. We purchased the Ultraseek search engine, at that time owned by InfoSeek. The Office of Technology (OT) needed a search engine to use across all the Minnesota Web sites, and after discussions with staff members involved in the Foundations Project implementation, decided to expand the Ultraseek license. When I moved to the Office of Technology several years ago, we began working on the North Star portal. We integrated Verity's Ultraseek into the portal and went live in 2002.
Lamont: How many agencies are represented in North Star?
Quam: We have 240 agencies in the state, and every agency has its own Web site. Some have more than one, in fact, so we have a total of 560 Web sites. Minnesota is a very decentralized state. The North Star portal is now the only door into all those agencies, but most of them are not yet hosted on the portal. A couple of them are, but most of them are managed separately by their own agencies.
Lamont: Was there a particularly convincing argument for integrating the Ultraseek search engine into the portal?
Quam: We were beta testers for the Verity Content Classification Engine or CCE, which organizes content into browsable topics. As soon as CCE was released, we implemented it on the DNR sites. So we had a taxonomy early on, before the portal. Then the North Star portal came into being, with its content management system and all the personalization that portals offer, and the site became much more mature. As with most content management systems, it had the ability to create taxonomies within the portal software. However, the OT soon realized that creating a taxonomy for the portal would be very labor-intensive. The upkeep of a portal version of a taxonomy also can be demanding. If links are broken, they don't get repaired until someone goes in and fixes them. And we already had a taxonomy from CCE, so I made a pitch to integrate the search engine into the portal.
Lamont: What was the implementation like?
Quam: It was very simple. We downloaded the APIs from the Verity Web site, and the integration was very rapid. The search engine is integrated into the portal to the point where I don't think the user knows what is search-driven and what is content management-driven. All topics on the site are run by the search engine, by the content classification engine. Our links are always current, because the search engine doesn't show anything that's a broken link. All the topics are dynamically populated so that when new pages go up we don't have to link them to the topics--that happens automatically.
Lamont: How does new data get tagged and added to the site?
Quam: We use the Dublin Core metadata here in the state, and our CCE rules leverage that. When I put in a new topic, I identify the terms that should fit in. The Web page creators put the terms on their pages, which classifies the documents into the correct topics.
In the search results, we can identify starred sites--that's part of CCE, to ensure that the most relevant sites always show up at the top of the results when people click on topics. If you click on nearly any topic in our structure, you will see a starred site that is hardwired, linked directly, to make sure that the most important pieces are always on top. The rest are metadata-driven. We also have quick links that take users directly to information.
Lamont: How does Ultraseek handle structured data?
Quam: Ultraseek accesses both structured and unstructured data. For example, we have a lot of GIS information in the state of Minnesota. The GIS data has its own metadata schema and databases. If a user searches for a piece of geographical information, the results will show unstructured pages on that topic, but will also get the user to the databases with the structured information.
Feit: I think that's an important point. Your users, especially in this kind of environment--a constituency support environment--don't know if they need to look in a database or full text. They want a piece of information, but there is no way for them to know how it might be stored or which tool they are supposed to use to get it. Putting it all behind one search box or a browse tree is a way to enable users who aren't familiar with the contents to just get to whatever they need.
Lamont: What developments have helped move search technology forward?
Reynolds: There has been a tremendous amount of progress in search technology over the past five or six years. Presenting a raw results list that is not sorted by relevance and just relies on keyword search is a thing of the past, as far as enterprise search is concerned. The most important contributor to moving beyond the sorry state of having to read through a hundred results is reflected in Eileen's title of "information architect." The idea that information architects look at things like standards for classification or metadata, as with the Dublin Core that she mentioned, is critical to presenting meaningful search results.
Lamont: How are standards making an impact?
Reynolds: Managing metadata in an intelligent way and looking across the organization, not from just one business unit or governmental department's perspective, but looking at it from the enterprise as a whole, is absolutely key. It enables organizations to use quick links and to promote different kinds of content when you want users to be able to have suggested responses as opposed to just cases of lists. The combination of organized metadata with taxonomies and classification schemes supports a cross-departmental view of information. From the user's perspective, it makes the whole access mechanism transparent. All of that rolls up into a much more attractive user experience than was available just a few years ago.
Lamont: Can you describe the actions that a typical user might go through in looking for an environmental regulation, or accomplishing a business task?
Quam: Across the top of the North Star home page are what we call "themes," such as education, transportation and natural resources. Along the left side are online services such as vehicle registration and business tax filing. Along the right side are the quick links. From the home page, for example, the quick links take users to travel information, a list of elected officials and so forth. Each theme contains its own mix of services, topics and quick links. Users can search or use the links. The search engine operates on the topical, subject-oriented information. The portal is designed for the users, rather than according to how the government is laid out. If someone wanted to find out information about soil pollution, they would not realize that they needed to go to four different agencies, but the search engine will help them find the right information.
Lamont: You mentioned that you already had a taxonomy when the portal was implemented. How was this developed?
Quam: I am a librarian by training, and my specialty is organization of information, so indexing, cataloging, taxonomy creation and thesauri are part of my background. I had six librarians working under me, and their job was to apply metadata as a proof of concept project to all these environmental pages to see if it was really going to help people find information better. Once we got the OK to have the search engine to be a statewide effort, I went ahead and used their time to create this original taxonomy, and it's still visible on the North Star Web site right next to the search box. It uses only terms from the controlled vocabulary in our state thesaurus. Developing a taxonomy is a tedious process, but there is a big payoff.
Lamont: How were the themes developed?
Quam: We had a group of representatives from multiple agencies work together to create the themes that are now on the top eight topics. Six of us decided what should be the top-level topics.
Lamont: Will taxonomy development be completely automated in the future?
Feit: Taxonomies are not as easy as turning on a piece of software; they never have been, and the tools that have claimed that in the past have not been of a good enough quality to meet the needs of enterprise users, government users or anybody else. However, some new tools that have come to market make the process a little bit easier. Verity introduced the Verity Collaborative Classifier, which allows the management of a taxonomy to be distributed to the subject matter experts—the various groups that have a stake in making that taxonomy a good one. Instead of having one ultimate caretaker for all the taxonomy across all of the state's Web sites, you can use a tool like Collaborative Classifier and distribute responsibility for one part of that taxonomy—one of those 300-plus sites that we talked about—to somebody in that group.
Lamont: How does this distributed development work?
Feit: You can have multiple authors, dozens or even hundreds of them, and still maintain the control and oversight of the centralized librarian who can make the decisions, who would know that's not consistent with the other sites, for example. You can make changes to the wording, or change structure centrally, but you are still relying on others to help build it.
Lamont: Can you compare that process to automated classification?
Feit: Once you've given a topic or a category in your taxonomy 10 or 20 examples, it can begin to find additional documents that seem to be a good fit and recommend them. Once again, the Collaborative Classifier can actually ask someone to approve or reject it in addition to relying on the automation. Tools have come a long way since Eileen first started, but the general concept of giving those users alternate ways to find information in context is what it's all about.
Lamont: Isn't there a risk of ending up with a more specialized vocabulary when subject matter experts take a lead role?
Quam: Yes, that's very true. I actually had that situation with the aviation division. Our state vocabulary uses the term general aviation, military and commercial. The aviation division knew what it meant, but to the public, general might mean "overall" rather than "private." Since we have the ability within the search engine to match up terms, if somebody types in "general aviation" I can get "private" to show up, which clarifies it for the non-specialist.
Lamont: What are your observations about centralized vs. distributed Web site development?
Reynolds: Different lines of business and government departments operate in different environments, and naturally they are going to use words and language differently. The positive thing is that the distributed development can be implemented in a shared environment, and you get many benefits from that, including learning curves for the people that support it. Participation may also increase confidence levels relative to whether it will all continue to work when the next version of the software arrives on the scene, and so forth.
Lamont: Is there a risk of too much decentralization?
Reynolds: What creates the greatest liability for any enterprise is the over-investment in and not so terribly intelligent deployment of lots of search engines, because you have many opportunities to "put search in there." But this will cause trouble in the long run, both for users and for the IT organization that has to support them all, because first the users can't support across the different engines and the IT group is spending a lot of time trying to maintain these engines through various versions of the software.
Lamont: When would you want to use multiple taxonomies or views of the data?
Feit: Sometimes the need for multiple taxonomies is not just driven by different business units or departments, but by a valid need for multiple views. Let's say I want to look for apartments close to my home. That's a geographic taxonomy. I might want to drill down just to see what's in that area. I might want to see all the parks, and all of the different city facilities and libraries. I want to see things near me.
Quam: We have a group that meets monthly to talk about the themes on North Star, and we had a brainstorming session last time to see what really interesting, futuristic things can we do to make the site work better. The idea of faceted taxonomies definitely came up. We have a kids' page right now that brings together things for kids across all agencies, and we'd like to have a senior page. We hope to add some other user groups and offer that type of faceted view.
Lamont: What are some of the lessons you have learned in the course of using the Ultraseek search technology in the North Star portal?
Quam: At least yearly we need to rework the taxonomy with input from the relevant holders of information. The other thing I learned is that we do need to use multiple taxonomies. I have two that are similar topic-oriented, but some other sites I've brought up have a services taxonomy that is by user group, such as services for business, government units or employees. Information is also available topically, but to approach it from a user group perspective is really good. Finally, whenever we get upgrades for the search engine, I like to have people test out the new features to see if they can add value to our Web portal.
Lamont: What should an organization do to keep its information structure current?
Reynolds: Eileen was right when she talked about reviewing all the elements in the taxonomy to see if they are still relevant. Verity's tools can provide a lot of intelligence about the "goodness of fit" for the documents in a particular folder, or when new documents are not getting appropriately classified. Not all search tools have this. It is important to conduct audits of classification on a regular basis.
About North Star
Minnesota North Star is the official Web site for the state of Minnesota. It's operated by the Office of Technology (OT), which also manages the state's IT expenditures and sets policies for technology infrastructure solutions. The Office of Technology has established an information technology architecture to support interoperability and cooperation among state services. As part of its role in developing the Web site, the OT was responsible for implementing a search engine for North Star.
(Part 2 of the Search Roundtable will focus on the future of search, including analytics, entity extraction and the role of search in business processes and composite applications.)