A goal of knowledge management over the years has been the ability to integrate information from multiple perspectives to provide the insights required for valid decision-making. Organizations do not make decisions just based on one factor, such as revenue, employee salaries or interest rates for commercial loans. The total picture is what should drive decisions, such as where to invest marketing dollars, how much to invest in R&D or whether to expand into a new geographic market.
In the past, the cost of collecting and storing limited the ability of enterprises to obtain the comprehensive information needed to create this holistic picture. However, automated collection of digital information and cheap storage have removed the barriers to making data accessible. Data is now available in abundance, but relational databases were reaching their limits in their ability to make sense of the information.
Volume, variety, velocity
New solutions have now emerged to deal with so-called "big data." Big data does not have a precise definition in terms of volume, but crosses into that realm when a relational database is no longer effective in analyzing the data. The solutions depend on breaking up the data, sending out subsets for analysis and then regrouping the results to produce the output (see sidebar following article and on page 10, KMWorld, Vol.21, Issue 4) for definitions of some of the components of big data, including Apache Hadoop).
Volume, however, is not the only dimension that defines big data. "Variety is also a factor because many different types of data may be pertinent to an analysis," says Mark Beyer, research VP at Gartner. "With the amount of information in documents and social media feeds such as Twitter, enterprises need to be able to combine their analyses to include information from both structured relational databases and content such as word processing documents, videos, images, blogs and Tweets."
Velocity is a third factor associated with big data. Not only is there a lot of data, but also it is coming in quickly and must often be processed quickly. In addition, velocity itself can vary. Take the case of two people clicking through a website. If data is being collected over time, some users will produce more within a given time period. "The variation in velocity affects analytical outcomes," adds Beyer, "particularly if the data model specifies an event."
In addition, when massive amounts of data are involved, a lot of noise is found amidst the relevant signals. "You need to be able to carry out an iterative process to discover what may have been overlooked initially, because every type of analysis evolves," Beyer says. "One person's news is another person's noise, so determining what each information consumer needs is important." One of the jobs that big data can perform is real-time filtering, to distinguish between the two.
Very few organizations are far along the maturity curve in dealing with big data, but the incentive is there. According to Beyer, the ability to address big data is going to be the most intensive and important infrastructure change for IT in the next decade. Moreover, it has major implications for knowledge management.
Big data in travel
One approach that is working well is using big data techniques to store, process and retrieve information along with established business intelligence (BI) solutions for detailed analyses. That approach combines the expanded capability in big data with the familiarity and usability of business intelligence products.
Expedia.com pioneered the online travel industry in the mid-1990s, and is the world's leading online travel site. The company offers a full range of services, including flight bookings, hotel reservations, car rentals, cruises and opportunities for special activities at travel destinations. Its websites provide information in local languages in 26 countries, and more than 75 million unique users visit the Expedia sites each year. Therefore, the amount of data associated with the visits and transactions adds up quickly, placing Expedia squarely in the big data category for its analytical needs.
The focus of Expedia's analyses is on customer service and retention, as well as measuring the effectiveness of its marketing activities. The company collects tens of terabytes of usage data each month and its data warehouse contains approximately 200 terabytes, an amount of information that would have been impossible to handle 10 years ago. Now, with a mix of enabling technologies, Expedia can store the data, analyze it and produce results that guide decision-making throughout the company.
"We are moving toward a broad vision for our analytics," says Joe Megibow, VP and general manager for Expedia.com, "combining a number of different technologies, including Hadoop for distributed storage." A key element in Expedia's analytics projects is the SAS (sas.com) Analytics platform. "Using SAS Analytics, we can take meaningful subsets of the data from large data sets for extensive analysis," says Megibow.
The analyses allow Expedia to determine customer preferences and evaluate the effectiveness of different marketing channels. A user of SAS Analytics for several years, Expedia is increasingly able to benefit from the big data not only from its transactional and click-through data, but also from the growing volume of social media input. "It has been easy to leverage SAS analytics alongside our other technologies to support our efforts to analyze large stores of data," Megibow adds. Within the past year, Expedia has carried out large-scale analyses of customer behavior over time and developed models that help determine causal relationships between its marketing efforts and customer response.
An important goal of Expedia's analytical efforts is to discover what advertising links, in which it invests, drive customers to its site and result in conversion from visitor to customer. "Most users visit multiple times before they complete a transaction," Megibow says. "Having this large volume of data stored and available for analysis provides us with important insights into customer behavior either at a particular time or over the longer term."
The emergence of big data techniques over the past few years has made a substantial difference in the analytic tasks that are feasible. "We always had big ideas for running queries against big data, but in the past, the technology did not yet support our analytical goals," Megibow explains. "Extracting a large amount of data could take a full day, and the analyses would have taken months. Now, we can pull big data sets in less than an hour and do our analyses in SAS in a timeframe that allows us to take business action on large volumes of data that we could not have before."
Comments carry weight
Big data capability is almost mandatory for analyzing social media. "There are many ways that social media relates to big data," Megibow says. "Text analytics lets us create data from large amounts of unstructured sources and build sentiment scores, which in turn can be related to consumer interests." People care a great deal about reviews and base decisions on them. "Travel is very social," Megibow emphasizes, "and a lot of weight is given to the comments of other travelers. People want to know what to expect when they open the door to a hotel, and these comments help reassure them about the destination, or in some cases redirect them to another one. We want to incorporate these observations into our analyses."
Megibow championed Expedia's initiative to support data-driven decision-making based on analytics. The effort required management of data from many sources into a system that allowed for timely analysis. Hadoop is "bleeding-edge" technology, and many companies have difficulty finding staff who have extensive experience in this emerging area. However, staff trained in classical statistics can readily use SAS analytics, without requiring the more esoteric expertise in Hadoop. The blending of multiple technologies has allowed Expedia to meet its goals for both big data performance and usability in its analyses.
A significant change resulting from the new technologies for big data is the ability to analyze all the existing data rather than sampling it. "Different data is relevant to different people in the organization," says Mark Troester, an information technology and CIO thought leader at SAS. "Previously, organizations had to pick and choose what they analyzed, but combining SAS Analytics with today's big data technologies lets them do it all." More complex models can be developed based on different segments of the population, and can be run quickly to stay abreast of rapidly changing conditions.
Staffing up for big data
Most leading BI vendors are developing tools for handling big data. For example, Pentaho recently announced that it is making available as open source the big data capabilities in its Pentaho Kettle 4.3 release. Pentaho Kettle can input, output and analyze data using Apache Hadoop and NoSQL stores. "What our product brings to the big data market is data discovery, analytic data mining and full visualization," says Ian Fyfe, chief technology evangelist and VP for product marketing at Pentaho. "The Hadoop framework is ideal for managing all kinds of data, such as video and audio, XML documents and blogs; but for interpreting and presenting the results, a BI solution is the most effective approach."
Although a lot of hype surrounds big data, it is a genuine phenomenon, says Fyfe. "Growth is nearly exponential," he maintains, "and not a lot of people fully understand it." Data scientists who are highly educated and mathematically literate are the most able to develop and test hypotheses about the data. Those who have mastered the technology are in high demand. BI solutions that can work synergistically with big data solutions will help expedite an organization's entry into this field.
Consulting companies are also responding to this need. Mu Sigma provides decision sciences and analytic services, and has ramped up its capabilities in big data. "We can provide support from direct services to training in Hadoop and related technologies," says Zubin Dowlaty, VP at Mu Sigma and head of innovation and development. "Big data is a very disruptive technology. For the first time, organizations have access to the type of analytics that previously could only be performed by expensive, high-performance computers."