Big data is a hot topic these days, and with good reason. The emerging technology provides new ways of analyzing massive amounts of data and extracting business value for a multitude of purposes. KMWorld senior writer Judith Lamont had an opportunity to interview four experts in this fast-moving field who share their insights here. They include: Kapil Bakshi, chief architect for Cisco Public Sector; Anjul Bhambhri, VP of big data, IBM; Charles Zedlewski, VP, products, Cloudera; and Dan Vesset, program VP for business analytics solutions, IDC.
Q Lamont: How did big data get its start?
A Vesset: What triggered much of the initial excitement about big data were companies like Google, Yahoo, Amazon, Facebook and Twitter, which all produce a lot of clickstream data that is only valuable if it is collected and analyzed. The volume and flow of information was such that traditional Web analytics methods were not capable of handling it.
Q Lamont: Why has big data become so important lately?
A Zedlewski: Data volume is growing faster than Moore's Law now, and the old ideas through which companies met the challenge of increased data are not sustainable. Also, there is a whole class of problems for which solutions have not been addressed because there was not a solution that was scalable, economic and flexible. With the new Hadoop technology, which can scale across thousands of commodity servers, these solutions become feasible.
Bakshi: The amount of digital information being collected and stored is growing exponentially, especially unstructured. According to one Cisco study, global IP traffic will reach 1.3 zettabytes annually by 2016, which is a fourfold increase from 2011. By 2016 there will be 19 billion global network connections, the equivalent of two-and-a-half connections for every person on earth. This new tsunami of data is being generated from new types of source, mostly machine-generated source, like sensors, smart phones and other Internet connected devices. All these trends together mean that a huge amount of data needs to be moved, collected, stored and analyzed to create value out of it.
Bhambhri: Businesses are realizing that they need to make decisions based on all the data that is available, particularly the 80 percent that is unstructured. Information is now coming from Facebook, Twitter and many other sources that did not exist before. People can express themselves in ways they previously could not. This information is valuable, particularly in consumer markets. Companies that do not tap into this information, but only look at point-of-sale information, are missing out on a lot of insights, and they are increasingly recognizing this.
Q Lamont: What are the primary drivers for big data?
A Vesset: One is efficiency. Finding the right tool for a given workload is important. Relational databases are not the most efficient way to store and process large semi-structured or unstructured data sets, so users are looking for alternatives. Another is innovation. Big data analytics lets organizations do things that were not feasible before, either because the technology was not there or because it was too expensive. Finally, compliance is a large and growing problem because large amounts of data need to be stored for longer and sometimes retrieved relatively quickly.
Q Lamont: What role do Cisco, Cloudera and IBM respectively play in big data?
A Bakshi: Cisco enables the connected systems of Internet of Things, which is the main source of (machine-generated) big data. Second, we are addressing the data in motion aspect of big data, and Cisco's networking products support capturing and moving around large sets of big data. Third, with our ecosystem analytics partners, we are providing data center network and unified computing-based architectures and solutions for big data analytics. Here we are focusing on MapReduce, NoSQL, In-Memory databases and massively parallel database systems architectures.
Zedlewski: Cloudera is an open source data management platform company. We provide a system that includes Apache Hadoop and other subsystems that let enterprises store, process and analyze large volumes of data. We have 400 partners that develop packaged software applications for our platform, ranging from business intelligence vendors such as MicroStrategy to Hadoop startups. IBM's BigInsights runs on top of our platform as well as its other tools for big data such as InfoSphere DataStage for ETL and data integration.
Bhambhri: IBM is making it easier for customers to handle big data through several offerings. InfoSphere BigInsights is a platform built on top of Hadoop that complements the open source product by providing analysis and visualization of data. InfoSphere Data Explorer is a discovery and navigation product that allows users to access and analyze big data along with data from enterprise applications. InfoSphere Streams analyzes data streams on a continuous basis, monitoring them for information in real time. Vivisimo can federate and integrate information from other enterprise sources to allow it to be incorporated into big data analyses.
Q Lamont: What are some of the viable use cases for big data analytics?
A Bakshi: Many government organizations have large amounts of data that can be analyzed productively. Hence the use cases are all around the government verticals of DoD, intelligence communities, healthcare, citizen services and scientific research and experimentation. Some of the more commonly discussed use cases include big data analytics for cybersecurity, intelligence, full motion video, electronic health records, financial fraud detection, scientific experimentation and many more.
Vesset: Big data supports improved fraud detection, not only in banking but also in government and retail. It is also effective at complex optimization, such as the sale of airline tickets where many variables can affect the price. Sensor data is also a very big area. Railroads, for example, have sensors on the train cars to monitor the performance of wheels to assist in maintenance, the goal being to fix problems when they are small. Utilities that are now using smart meters are looking for ways of using that data to improve the availability of power, load balancing and responding to outages.