Get the KM Buyers' Guide and 2016-2017 State of KM Survey Report

Big data: expediting and validating analyses

This article appears in the issue January 2016, [Volume 25, Issue 1]
Page 1 of 3 next >>

   Bookmark and Share

The market for big data products and services continues to grow, and the number of available technology options is increasing. The latest projection from IDC is for the market to expand from $17.2 billion in 2014 to $21 billion in 2015. The software sector is expected to grow at a rate of 26 percent per year from 2014 to 2019, with infrastructure and services also increasing at a rate of over 20 percent per year during the same interval.

“The biggest shift since 2013 is that the hype has subsided,” says Dan Vesset, program VP of business analytics and big data at IDC. “A growing number of companies have gone through the initial deployments of some of the new technologies such as Hadoop. Now the conversation is more pragmatic about data integration and governance issues, appropriate analytic techniques and specific value to business.”

A key issue is how to get up and running quickly, given the complexity of setting up big data storage and analytics, as well as the shortage of qualified data scientists. One option is to turn to big data services that can be tapped into as needed. Unisys ( has created a service that combines an analytic platform with expertise from data scientists and subject matter experts. “This service evolved from our extensive experience in mission-critical homeland security initiatives,” says Rod Fontecilla, VP of application services with Unisys Federal Systems. “We have been able to leverage it for several other verticals, including telecommunications, life sciences, financial services and retail banking.”

One leading telco company wanted to create new products that reflected what customers really wanted. “We looked at several years of data and created some interesting predictive models on customer segmentation,” Fontecilla says. “We identified about 10 main characteristics of customers based on millions of data points and provided a dashboard of those categories. The company felt it understood its customers for the first time and is now using this information to develop new products that match the customers’ profiles.”

In another application, a financial service company with a million loans wanted to decrease loan delinquency but could not determine the best approach. At first, it seemed the company should look more carefully at credit ratings. “Then they started to look outside the box and found that a key predictor of bankruptcy was when the loan crossed the threshold of being 60 days overdue,” Fontecilla explains. Once the institution identified that turning point, it was able to focus attention on the high-value loans that were moving toward being 60 days overdue.

The human element

Crowdsourcing can be used in conjunction with analytics to enhance results by providing insights that are difficult for automated solutions to produce, especially those relating to context or linguistic interpretations. As a leading social media company, Pinterest is intently focused on bringing relevant content to its customers. One of the top sources of referral traffic, Pinterest tends to attract serious buyers who are looking for products to purchase. The site has millions of pictures posted by visitors, though, so if the search process is not effective, visitors will rapidly get discouraged. To determine whether its algorithms are working optimally, Pinterest began using CrowdFlower to validate search relevance using, as the name suggests, crowdsourced evaluations.

“CrowdFlower fills our requirement for human curation and labeling,” says Mohammad Shahangian, data scientist at Pinterest. CrowdFlower is truly crowdsourced; it has a signup process for people who would like to assess relevancy of content and anyone can sign up. The responses are first vetted against a known sample to provide an indication of whether the individual is good at the job. Evaluators continue to receive a sprinkling of test items even when working on the full data set to make sure they are on the right track.

Burst capacity

Prior to using CrowdFlower, search engineers at Pinterest would develop an algorithm that included a variety of factors that might affect relevancy and then measure the impact of the algorithm on user engagement. “This process typically took several days,” Shahangian says, “and the engineer could only test a finite number of queries.” Now, Pinterest can assess the improvement in relevance within a matter of hours, and the test does not have to be run on actual users.

The flexibility of launching a test is also a big asset. “We can quickly have 100 or 1,000 people respond to a sequence of queries because the system has ‘burst capacity,’” Shahangian explains. “And CrowdFlower has layers that makes it easy to perform the evaluation tasks. We can get concrete examples of where the algorithm went wrong or prove that it is doing what we expected.” In the future, Pinterest plans to use CrowdFlower to validate imagery searches for a particular item as well as text searching.

Page 1 of 3 next >>

Search KMWorld