-->

KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

A Primer on Structured vs. Unstructured Search

(And why read a primer on structured and unstructured search by me? More on that in a moment.)...

If you search for a monkey in the jungle, it's tougher than finding one at the zoo, and if you search for unstructured content, it's tougher than finding structured content. Completely different search technologies specialize in one or the other, but many technologies now say they search both. That's why if you're researching new search solutions, you're going to hear about the difference between structured and unstructured information, and it's going to get confusing, because those definitions have evolved over time.

So where's the decoder ring to know which kind of content you actually have? Why do different technologies specialize in one or the other? And why should that matter to you?

By way of background, I've spent time in the polar extremes of technology companies that manage structured or unstructured data. Now at Endeca, I'm straddling the two. A decade ago, I worked at Teradata, then the foremost vendor of solutions for massive-scale structured data. Back when a terabyte was still huge, we built the data warehouses that powered all of Wal-Mart's transactions. After that, Endeca's CTO, Dave Gourley, and I joined the founding team at Inktomi, where we helped launch the largest-scale unstructured search engine ever built at that time, then powering more than half of all Internet searches. But you would never use Teradata's technology to search Inktomi's content, and vice versa. The user experiences offered by the two were very different.

Who Says It's Structured?
Veterans of our industry insist that for data to be called structured, it must live in a database. By elimination, all other content is unstructured. Sales transactions in data warehouses are structured, and PDFs of those sales transactions are unstructured.

But that definition of structure doesn't feel satisfying, since it classifies the content based on the technology that stores it, and not based on how easy it might be for users to find it. Doesn't that PDF of transactions feel more structured than a PDF of the Congressional Record? And does that change if the Librarian of Congress tagged the PDF with metadata?

Identifying content based on where it resides is no longer meaningful, since that distinction was set back when a terabyte WAS still huge, so you really did need different technologies optimized to search different kinds of storage systems. But today, most of the top search vendors can index content from hundreds of different formats and storage systems, including databases, which means that most search both structured and unstructured content.

Information theorists measure the "information content" of a message in a different way, which, to simplify, tells how far the message is from random noise. A definition in this vein would feel a little more satisfying, since it rings true to our common sense feeling that a well-tagged PDF of the Congressional Record IS more structured than, say, a blog mentioning an act of Congress. But measuring "information content" still doesn't help anyone to more easily find something.

Search, Or Find?
So how can structure help people find things? "Search" itself is a misnomer, since most solutions now employ a whole bag of "find" features beyond the search box, and those almost all rely on the varying degrees of structure that all content holds. Every vendor has its special sauce—whether it turns structure into Guided Navigation that helps users browse, use it to improve relevance ranking, or even run charts and graphs. So the question now flips from whether a technology can search structured or unstructured content, to how can your search solution best leverage whatever structure is in your content.

And that takes us finally to a more meaningful definition of structure. We define structure in relation to its suitability to make organized information easier to find than messy information. And by that measure, content is no longer a binary structured/unstructured; instead, it falls on a continuum between those extremes. You have highly structured content in databases, and less structured content with metatags, fielded information, tags from "folksonomies" and file system paths. Lots of structure is implicit, and smart software can extract that, too. For example, entity extractors can pull out names of people and places that might be helpful to filter on, and logs and links know which content is most popular. All of this structure can help people find and reuse information, but to varying degrees.

We started Endeca because the price/performance of computers had come so far that it was time to engineer a new kind of technology that could finally take all kinds of structure on the continuum and turn that into new kinds of tools that make content easier to find. It no longer made sense that the technology powering Teradata and Inktomi shared almost no code in common. So with these new kinds of tools on the market, don't ask whether you can search structured or unstructured content, but ask how structured your content is, and how that content can help people find more in enterprise jungles and zoos.


Endeca, headquartered in Cambridge, Massachusetts, was founded in 1999 to transform the online search and navigation experience so that people can easily access the full breadth and depth of large data sets. Today, Endeca solutions for enterprise search and commerce are already helping businesses across a variety of sectors including financial services, manufacturing, retail, information providers and business-to-business with applications that address the information overload problems associated with enterprise information access and retrieval and content and catalog management.

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues