As Stephen Purpura, CEO of Context Relevant, was describing a fraud detection application, my mind was writing rules. I spent a number of years crafting expert systems based on interviews with experts. Purpura was clearly describing conditions that any knowledge engineer would recognize as conditional if-then statements, which when loaded into a forward-chaining expert system and applied to data, would analyze that data and tell the customer when fraud was being committed, with some degree of probable certainty.
Purpura, however, wasn't describing a system that took hundreds of hours of software engineering to develop, test and deliver. The Context Relevant fraud detection system didn't act on data; it was derived from the data itself. Moreover, the system he described worked in near real time from transactions being made across the Internet.
Imagine the new phenomenon of flash fraud, where a person places 500 simultaneous transactions using a single card across multiple retailers, or one person creates 150 different cards in rapid succession at one site. Because of the latency of current fraud detection systems, the fraud has been committed hours, if not days, before it is discovered. Context Relevant's system monitors data and identifies the fraud in less than a minute. And its model, rather than being handcrafted over months, took a couple of dozen Python statements to code.
Context Relevant and companies like it are reigniting the heat beneath machine learning, data topologies and other technologies to meet the demand for finding big data's hidden gems that will lead to medical breakthroughs, innovative new materials or improvements in trading and markets.
What has changed over the last 20 years? Faster, more reliable hardware, better development suites and developers who better appreciate the need for quality production systems. That, along with data that earlier generations couldn't imagine having access to, has brought the predictive analytics market to a new level of relevance. And if all goes well, those systems will be deployed not just to data scientists, but also to subject matter experts within organizations who can transform newly discovered insights into business value.
Welcome to the era of machine learning
The World Future Society and McKinsey Global Institute see a future where the need for data analysts far outstrips supply. McKinsey's report Big Data: The Next Frontier For Innovation, Competition and Productivity states that "by 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills, as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions."
There is, however, another scenario. Rather than tens of thousands of individuals handcrafting SQL queries based on a hypothesis, and then refining that hypothesis over time, a machine learning system looks at the data, takes into account a basic set of goals and then presents patterns that it discovers in the data. For the previous flash fraud example, traditional approaches would prove humanly impossible when facing 500 simultaneous transactions through a banking or retail system. And unlike fragile, rule-based systems that often fail when their logic no longer correlates to real-world conditions, machine-learning systems would present anomalous data patterns regardless of whether they had seen the pattern before or just discovered it.
Cutting through the big data hype
No one knows exactly how much of the data being created is analyzed for value, but the fraction is very small. Big data isn't a panacea just because data exists. When not flowing through a system, data lies inert on disk drives as magnetic representations of ones and zeros. There is no inherent value in having more ones and zeros unless those who own them know how to intelligently extract value from the relationships and patterns implied by how the words and numbers related beyond any explicit organization (as in rows, columns and tables).
One of the big problems with big data is not having the data, either because it isn't available or because it wasn't considered. Predictive crime, for instance, shouldn't rely exclusively on police data. With increasing levels of public data from sources like Facebook and Twitter, privately collected data can be supplemented with public data to help make correlations. GPS-tagged data, combined with police reports, could solidify insights about where crime is taking place as criminals or their victims report activity through public social media.
Ayasdi applied its Iris topological analysis to cancer data from 507 breast cancer patients, including somatic mutation, gene expression and clinical data. They found patterns that earlier researchers didn't see, like the role hormones play in mutations and cell regulation.
Eric Schadt, professor and chair of genetics and genomic sciences at Mount Sinai Hospital, sees analytically driven visualization as a powerful tool making sense of the new data being collected in healthcare: "In the past, drawing correlations in complex datasets has taken months or years of work and often exceeded available computer analytic capabilities," he says. "Now we can take hundreds of thousands of variables and score them across hundreds of thousands of people and try to extract patterns. We're now able to ask some novel questions. My team and I use the Ayasdi software to study bacterial outbreaks and genetic mutations."
Another problem with traditional analytics is that it often involved black boxes. Gartner VP Merv Adrian says, "The trouble with black boxes is that they conceal black swans." Algorithms were written, data was processed and results were presented. As financial markets and ultimately the world learned, when the Gaussian copula function that predicted investment risk stopped working effectively, no one noticed. Investment bankers were paid to heed risk recommendations, not question the underlying system. Wired magazine dubbed the formula "The Secret Formula that Destroyed Wall Street" on its March 2009 cover.
New approaches to predictive analytics learned from the financial market fiasco by making reasoning more transparent. If the outputs and the reasoning don't make any sense, numeracy-literate domain experts should avoid acting on them. Download Chart.
Forecasting the wrong workforce
Asking the right question of the right data is a common thread in big data analytics, and it is one of the reasons that McKinsey forecasts the need for so many data analysts. But what if machine-learning systems discovered patterns, and the real need was for people who could transform the insights provided by systems into new solutions, remedies and innovations?
It has been suggested that the future of the pharmaceutical industry would be tightly bound to an understanding of the human genome. But when DNA gave up its secrets to the computer, researchers quickly discovered that many assumptions were incorrect. Insights will come not from a genome, but rather from the comparison of multiple genomes.
Unfortunately, as personal genetic mapping becomes commonplace, the shortage of data analysts will come into play. Although common genetic markers will be easy for clinicians to identify for patients, comparing data about human genetics will quickly outstrip humankind's ability to make sense of what it is seeing, unless it automates its vision.
In Race Against the Machine, authors Erik Brynjolfsson and Andrew McAfee posit a future in which computers replace more and more labor, including non-physical labor. Most jobs during the most recent recovery come from low-wage service industries, not from high-paying managerial or skilled-labor roles.
Data analysis may be a boon for mathematically inclined graduates, but if data analysis becomes productionized, much of the work won't be for analysts, but for subject matter experts. Much like factory floor robotics, the labor shifts away from assembly work. It can even be argued that automation and the reduction of manufacturing costs funded improved customer service; higher-quality, more innovative products; new retail models and even shifts in advertising. But unlike robots that simply replace manual labor with automation, analytics potentially generates new insights that will require a workforce in place that can transform those insights into business value. And because the insights will often come at a rapid pace in highly competitive industries, a workforce and business models are required to keep pace with change.
The future of analytics
Computing is cheap and fast, and it's getting cheaper and faster. All the world's music will fit on a $500 disk drive, acquiring incremental CPU power is inconsequential, and software tools like Hadoop are available as open source. Rather than spend years training data analysts and invest in providing them industry-level experiences so they can correctly build models and attempt to anticipate the questions that need to be asked, why not apply computational methods to speculate on hundreds of hypothesizes and ask hundreds of questions and see which ones lead to something interesting? Much as IBM's Watson or Deep Blue used the brute force of machine intelligence to conquer trivia games and chess, predictive analytics firms apply inexpensive computing to parse out insights hidden among a wealth of seemingly unrelated data.
Unlike IBM's demonstrations of computing innovation, those firms and other emergent competitors in the space have much more humble ambitions. They don't want to fuel public spectacles, but rather quietly and accurately help commercial firms, researchers and government agencies make better sense of the data they collect so that they can deliver power more efficiently, prevent crime or create more personalized medicines.