Get the early bird discount when you register now for KMWorld 2017 in Washington DC

Psychographics, statistics and Big Brother

This article appears in the issue May 2017, [Volume 26, Issue 5]

   Bookmark and Share

In March, The New York Times published a front-page article titled: “Bold Promises Fade to Doubts for a Trump-Linked Data Firm,” which focused on the work of a Republican-leaning data analytics firm called Cambridge Analytica and its claims to be able to predict voting outcomes for any citizen across the United States by running algorithms on personality and political expressions data.

It is time for the press to take a crash course in statistics. When a company like Cambridge Analytica claims its software can “predict the personality and hidden political leanings of every American adult,” we need to take that promise with an ocean of salt. The fact is that any of the techniques available to the market today for predicting one’s actions based on a profile can be important tools, but they cannot, with any great accuracy, predict the behavior of one single individual. They rely on groups of similar individuals to predict the probability of one person’s actions.

Text and sentiment analytics

Probability, not certainty. Marketers, pollsters, advertisers and search engines all use those techniques for various purposes. So, too, do physicians in trying to craft a treatment plan based on other people’s cases whose conditions are most similar to yours—genetically and historically. We want to increase the probability that we will get a more effective medical treatment or that we will get recommendations for movies we want to watch or products we want to buy. In a world drowning in information, most of it irrelevant, we need this kind of help. But no technology today can delve into the personality of each individual to accurately predict that person’s actions.

Text analytics and sentiment analytics are two technologies that are used for that purpose. They have a long history of usefulness. At least 15 years ago, companies like SPSS were able to find patterns of words that helped telecom companies predict whether a customer was LIKELY to move to a competitor, based on the customer’s use of words. Banks used that software to improve their chances of predicting whether a customer was likely to default on a loan. Netflix went so far as to sponsor a competition with a $1M prize for anyone who could design an algorithm that would beat the performance of Netflix’s own collaborative filtering algorithm in predicting what type of movies particular Netflix customers might like to see.

Probabilistic recommendations

The operative word is MIGHT. In the nearly 20 years or so that we as consumers have been exposed to probabilistic, data-driven recommendations, we all can share stories about really dumb recommendations we have received and whole marketing campaigns gone weirdly wrong.

In the political sphere, Cambridge Analytica is far from the first firm to take on predicting voting patterns from political expression and demographics. A number of years ago, text analytics firm Linguamatics analyzed tweets in real time during one of the British political debates to see if conservatives used different words from liberals, in order to predict their subsequent voting behavior.

In each of those examples, the critical thing to take note of is that we are talking about how groups of people behave within the context of a particular task—this is the “collaborative” that lies at the heart of collaborative filtering.

Collaborative filtering

Here’s how it works. First we find a very large group of individuals whose behavior we already know with some certainty: lovers of action films, buyers of running shoes who also bought smart watches, etc. We build models, we look for patterns, we train a system for identifying the salient features that distinguish that group from all others. But we really don’t know from that process what the most important predictor of a behavior may be. It’s entirely possible, for example, that those who disliked Hillary Clinton didn’t like her voice, or those pant suits, and that policies had nothing to do with their emotional response to her.

Once we have developed our “profiles,” as new customers (think data) come in, we try to find a close match for them to one of the groups. That approach can work nicely, as long as we are not seeking perfection. If we improve our movie recommendations by 10 percent, that’s a big advantage for Netflix and for its customers who don’t have to watch movies they really hate. (In fact, a 10.06 percent improvement won Netflix’s $1M prize.) For cancer patients, such an improvement may mean the difference between life and death. But 10 percent or even 80 percent is not certainty when it is one individual’s movie choice or if their life is at stake.

The inexactness

By and large, a recommendation system may do pretty well by being right 50 to 70 percent of the time. The important word that is missing from most news reports on modern analytics practices and results is probability. Software is notoriously incapable of understanding individuals, just as statistics is. Collaborative filtering is not the only vulnerable piece in the puzzle. With all the attention on machine learning in cognitive systems and big data analytics today, there continues to be a lack of understanding of the inexactness built into any analytic approach that is inherently probabilistic. That imprecision can come from human bias, error or oversight that gets expressed in the training process of machine learning algorithms. It can also come from the operation of the algorithms themselves, and there is a well-understood but still difficult to manage presence of inherent bias and “drift” in the results of algorithms as they move through analytic cycles.

As far as Cambridge Analytica predicting my vote or your vote individually, those claims have already suffered embarrassment enough on the front page of the Times. But their claim to have broken the code on vote prediction serves as a cautionary tale to anyone working on cognitive systems. Even our improving technologies still only provide tools that are quite crude when it comes to many analytics tasks. We rarely get THE right answer. We need to learn to manage decisions based on ranges and estimates, and, yes, probabilities.

Maybe the time will come when Big Brother knows your next move, and mine, but I doubt that that will be any time soon. Until then, with all the endless variety of humans and their motivations, there’s plenty of room to hide.

Search KMWorld