There’s a reason why the big data trend is keeping many companies so busy. We’re creating vast amounts of data every day. According to a study conducted by IBM, 2.5 quintillion bytes of new data are generated on a daily basis through things like news, online transactions, or social networking. How can we get to grips with the overwhelming task of analyzing this information? What exactly does this megatrend hold in store for us? We asked Erwin Selg, CTIO of the GFT Group, for his expert opinion.
JB: Hello Erwin. Big data seems omnipresent at the moment. We know it’s all about data, but what makes this big data megatrend so different from conventional data processing and analysis?
ES: Enormous data volumes are being generated these days, doubling in size every two years, primarily on account of increasing electronic data sources. Conventional data sources or graphical representation technologies simply can’t process these volumes. We need new ways to solve this problem, ideas for how to cope with the collection, storage, distribution, search techniques, analysis and graphical representation of such large volumes – in acceptable runtimes. Add to that the fact that part of this data is unstructured. So another challenge we face lies in gathering insights into this mixture of structured and unstructured information, like all the millions of websites out there.
JB: We’re now hearing the term “fast big data” more and more. Analyzing huge data volumes doesn’t seem to be the problem, but rather gaining real-time accessibility. People often point to solutions like SAP HANA and Apache Hadoop in this context. Which technological solutions exist at present? And which ones do you think will establish themselves?
ES: It’s important to make a distinction here. If you’re carrying out a long-term analysis of geological data you don’t need a real-time solution. But if you’re a supermarket and you want to analyze every single buying decision made by your customers on a nightly basis (as a popular American store chain does) in order to adjust your prices for the next day, you need to do it quickly. The time needed for this kind of analysis has now been reduced from several days to 20 minutes. This was made possible by using in-memory technology – technology that is sure to become standard as the existing platforms become more developed and the storage costs continue to drop. Companies are particularly interested in the kinds of real-time analysis solutions that can prevent or greatly improve lengthy data transfers from transactional database systems to the analysis systems. SAP HANA has an interesting way of doing this. No doubt cloud-based services will also come into play. They make use of enormous in-memory platforms – real-time analysis for everyman. Small medical labs, R&D departments at small and medium-sized companies, or nuclear researchers could all easily use such grids, on-demand.
JB: But what’s the situation in practice? How many companies have already implemented these kinds of solutions? This often affects internal processes. So how long does it generally take before a project of this magnitude is completely implemented?
ES: Developments are still in their infancy. Most companies that have started looking at the big data issue haven’t moved past the experimental stages. Only few companies that have come up with a clear business case for a switchover, despite the high initial costs, have actually implemented a productive and usable platform. There’ll be huge changes in the coming years as infrastructure costs go down and the technology matures. If companies turn their back on this deluge of knowledge they risk gradually blending into the background and losing their competitive edge. Gartner estimates that approximately 35% of large and medium-sized companies will efficiently implement an in-memory solution by as early as 2015.
ES: Indeed, in the not-so-distant future, most companies won’t be able to accommodate the amount of data they produce. And this is despite the lower storage costs. Outsourcing data might be an option, but it will definitely be important to think beyond that. Lots of data does not necessarily mean quality data: quantity doesn’t equate to quality. So another crucial step will be to segment the data, and find a strategy to start avoiding unnecessary data accumulation. Which data do we really need? Which data is relevant for real-time analysis now, but will no longer be meaningful later down the line? Segmenting is also crucial if you want to outsource data. So less security-sensitive data can be stored in the cloud (if possible, a national cloud). The other more sensitive data could then be kept on the premises for the time being. Last but not least, it’s also important to think about advancements in compression technology.
JB: What options does big data offer in terms of business intelligence (BI)?
ES: Until now, BI in companies has been affected by the fact that data is no longer current by the time it’s used for analysis. Analysis takes too long and can’t aid decision-making. In the cocktail of big data issues, real-time technologies no doubt make the most direct contribution to improving corporate BI. Another major drawback with the current BI in companies is the fact that almost the only data being referred to is structured data. By using big-data technology to include unstructured information, you can add more context to the process and the results of an analysis provide a more realistic picture. Even recognizing patterns and correlations is improving. Big data technology enables companies to expand their knowledge to “the real world” and not just base it on business intelligence derived from internal data. This is made possible by things like systematic corporate intelligence and social intelligence.