Learn how to build a data-driven, knowledge-based enterprise. Register for KMWorld today!

Open data commons for business

This article appears in the issue July/August 2011, [Vol 20, Issue 7]


   Bookmark and Share

It is hard to overestimate the importance of open data commons to science. That's why if I were the CEO of a company, I'd be building an open data commons for my business.

The simplest idea behind an open data commons is that there is value in making massive amounts of data available for any and every purpose anyone can come up with. So, make a giant pile of data and invite the world in to play. From that come the first two principles of an open data commons: Put in everything, and let anyone use it.

But, of course, it then gets more complicated. You can't really put in everything. And people won't be able to use any of it if you don't provide the metadata that lets them know what's what. You can't put in everything if only because your commons is going to be about something. It's going to be a commons of data about genomics, economics or Barbie doll models. So, you may want to exclude data about

Formula One racing and panda insemination frequencies. Even so, if you provide usable metadata, you can include all the borderline cases, because the commons' users will be able to sort out what isn't relevant to their project.

Nevertheless, you'll want to do a few more things to make your data commons useful.

First, you'll want to encourage the use of additional metadata so that people know where the data is coming from and what its quality is. That's important because allowing the inclusion of raw data vastly lowers the hurdle for those who have data to contribute. If they have to verify each stat and line up all the decimal points, they'll never get around to releasing the data in the first place. Half-baked data that is available is infinitely more valuable than fully baked data that is not.

Then you'll want to consider the matter of a license for your data. For example, OpenDataCommons.org has two licenses ready-made for your data commons: one that puts the data into the public domain, and another that requires those who use it to attribute where they got the data from and to share the data under the same conditions of openness. ScienceCommons.org (a part of CreativeCommons.org) also will help you out, as well as give you reasons why requiring attribution and share-alike are generally bad ideas for databases.

You'll also want to decide how you want the data structured. Or maybe you'll simply want to require the contributors to describe how they've structured their contribution. The first way makes the data more easily searchable and reusable, but it also requires contributors to conform to your standards. Plus, data standards can inadvertently result in obscuring or eliminating data that might be useful for some unanticipated purpose.

More and more commons

In fact, you might want to make your data available as linked open data, which prescribes the general form of the data (RDF "triples" of the form "A is in relation B to C") without over-specifying what the universal standard semantics and terminology for your commons' topic should be. Linked data makes it easier to pool data into a commons.

Having wended your way through these questions, you will have a commons of data, the main value of which is that its value cannot be predicted. Researchers and innovators will come with questions you would never ever have thought of, and they'll get answers that no one ever anticipated.

That's why we're seeing more and more data commons. Data.gov gathers tons of information from U.S. executive branch agencies. The Genome Commons has data that help interpret genetic information. The Proteome Commons has a huge amount of information about proteins. MetroBoston includes data about cities and towns in Massachusetts, with topics including public health, housing, arts and culture, and education. The list is getting longer and the data pool is getting wider and deeper, quite rapidly.

So, why should you do this in your company?

Because your company has lots and lots of data, and you can't know what value the data has until everyone has a crack at finding its value. So, create a company data commons, and put everything you possibly can into it. Sure, you're going to hold back on some personnel information, and information that is truly confidential. But all the rest should go into the commons. Keep it behind the firewall if you must, or require company ID to login, but let your folks know about it, and encourage them to fish through it. You might even want to provide them with some analytic and visualization tools. Provide a place for people to post their findings, and publicly reward the most innovative and pragmatic results.

Open up your data to your community. You never know what they will make of it. And that is exactly the point.


Search KMWorld

Connect