Big data, cows and cadastres
What is big data? Speakers at a conference I attended in the spring explained that billions of Twitter messages shoot around each day. Digital information is doubling every few months. A gigabyte of data is nothing compared to a petabyte. Analytics is the key to unlocking the value of "huge flows of structured and unstructured information."
Why have analytics or, more accurately, advanced mathematical procedures become the solution to finding information? Is social media the answer? Can Twitter messages (tweets) and Facebook postings improve the revenue of an organization? Two days of presentations provided many generalizations but scant evidence of those buzzwords delivering a payoff to organizations.
On the flight home from the conference, I wondered about the excitement over big data, analytics and social media, and I thought about the Soviet mathematician Andrei Nikolaevich Kolmogorov (1903-1987). For social media consultants, Kolmogorov means little. He lives on for advanced math students who learn that Kolmogorov (assisted by one of my distant relatives, Vladimir Igorevich Arnold) cracked one of Hilbert's 23 problems. Kolmogorov was into big data when the words and numbers were in quarto volumes stuffed with tiny print and hand notations.
Big data and analytics arrive
While a university student, Kolmogorov decided to analyze cadastres (land registers) compiled on a farm-by-farm basis for tax purposes. Money, then as now, was important. The tax documents were incomplete and fraught with inconsistencies. Young Kolmogorov, who reveled in detail, manually identified anomalies. When confronted with inconsistencies, he employed Bayesian and other probabilistic methods to make sense out of the hundreds of thousands of information items in hundreds of thousands of what today would be database records.
The analytics paid off. Kolmogorov's manual calculations revealed that what tax authorities used as a method of counting land was in reality a way to measure total profitability. For those looking for revenue, Kolmogorov's findings were revelatory in post-revolution Russia. Big data and analytics had arrived. It was not until the 1970s that Kolmogorov's application of advanced math to hybrid data emerged as a discipline for the study of historical documents.
Big data and bulls
Today the ubiquitous phrase "big data" does not evoke dusty, leather-bound ledgers. For the dairy industry, big data and analytics mean more money from milk-producing cows. "The Perfect Milk Machine: How Big Data Transformed the Dairy Industry" (theatlantic.com/technology/archive/2012/05/the-perfect-milk-machine-how-big-data-transformed-the-dairy-industry/256423) offers an interesting example of the value of big data and analytics in an industry sector plagued by narrow margins and rising costs.
The hero of the story is a bull named Badger-Bluff Fanny Freddie. Dairy cattle sired by him yield more milk. Genetic information processed by sophisticated numerical recipes yield more efficiency. With Badger-Bluff Fanny Freddie, the dairy industry has an opportunity to convert big data into more milk per head. Therefore, the knowledge generated by big data analytics methods translates directly to money.
The article explained: "Dairy breeding is perfect for quantitative analysis. Pedigree records have been assiduously kept; relatively easy artificial insemination has helped centralize genetic information in a small number of key bulls since the 1960s; there are a relatively small and easily measurable number of traits—milk production, fat in the milk, protein in the milk, longevity, udder quality—that breeders want to optimize; each cow works for three or four years, which means that farmers invest thousands of dollars into each animal, so it's worth it to get the best semen money can buy. The economics push breeders to use the genetics."
The buzz about social media suggests that entrepreneurs and those interested in obtaining high-value information can blend big data, analytics and social media. No one can argue that mobile devices output huge volumes of data 24 hours a day. Twitter messages flow in the billions. Facebook has 900 million users who post text, images and links in numbers that are described as petabytes and zettabytes. But a number followed by 21 zeros does not translate to something that I can easily grasp. I understand tax rates. I understand the bull.
Established enterprise knowledge management vendors have big data solutions based on dozens, if not hundreds, of separate components. The established enterprise software vendors build a custom or semi-custom solution. IBM is a good example of the traditional approach to enterprise big data systems.
IBM is often the system vendor of choice in the financial services, healthcare and manufacturing sectors. Where IBM does not have a foothold, its archrivals Microsoft, Oracle or SAP, among others, will. Let's focus on IBM. The company has invested billions acquiring companies that have big data, analytics and content synthesis technology. Cognos, SPSS and Vivisimo are part of the IBM lineup. In addition, IBM research teams have created proprietary solutions such as Web Fountain technology to discover nuggets within a fast-rushing flow of data.
The IBM approach is to understand the prospect or customer's problem, develop a plan of action and then assemble the solution from the components in IBM's toolbox. IBM can deliver a traditional business forecasting system using Cognos or assemble a predictive analytics solution to guide the client in its financial strategy. The approach is successful because it makes sense to Fortune 1000 companies, commands six and seven figure engagements, and leads to evergreen revenue from upgrades, customization and maintenance contracts. The problem is that millions of organizations want to tap into big data without the IBM-scale investment.
At the same time that enterprise giants like IBM were ramping up their big data capabilities, open source solutions were thriving. Google made available some of its data management technology called BigTable. As open source, other developers have been able to stand on the shoulders of Google's engineers and create such open source software systems as Hadoop (hadoop.apache.org), named after a child's toy elephant, and Cassandra, named after the daughter of King Priam and Queen Hecuba of Troy.
Entrepreneurs realized that the giants of enterprise software were creating opportunities for products and services. Within the last year or two, open source analytics companies have sprouted. Examples range from Palantir (palantir.com), a company that has received more than $150 million in venture funding, to Pentaho and Jaspersoft, two companies that have used open source technology to deliver innovative, high-value solutions.
Consider Pentaho. The system is constructed from open source components. The company says it offers "intuitive, Web-based software," and asserts: "Pentaho tightly couples data integration with business analytics in a modern platform that brings together IT and business users to easily access, visualize and explore all data that impacts business results."