The Truth About Hadoop


Hadoop-LogoThe press often talks about Hadoop as if it were a single technology or even a product. This is far from the truth and creates a false expectation in the industry about the complexity of using Hadoop in real-life situations. I thought I would clarify this and provide a quick overview of the key components of Hadoop and how they can be used in the financial services industry.

To start, Hadoop is not a single technology or product; it is a framework of related open source technologies that provides reliable and scalable distributed computing.

Hadoop is based upon two core technologies:

  • HDFS (Hadoop Distributed File System) – a distributed file system. We have used HDFS, for example, to store XML files representing a huge archive of trade data. HDFS permits years and years of data to be stored, current and archived data together, something a single file system or database is unable to do.
  • MapReduce – a programming model for distributed computing. MapReduce provides a Java API to break an “embarrassingly parallelizable” job into pieces and federate it to a number of servers. We have used MapReduce to improve the performance of ETL processes, for example the daily loading, standardization, and enrichment of data of tens of millions trades collected from across a large investment bank.

With HDSFS and MapReduce we can do a lot, but there are many other tools in the Hadoop shed which are necessary to build a complete, enterprise-ready solution: 

  • Sqoop – bulk transfers of data between HDFS and relational databases
  • Flume – collection and streaming of data into HDFS
  • Oozie – coordination and scheduling of MapReduce jobs so they can be linked together to create a more complex workflow
  • Zookeeper – maintenance of configuration information and synchronizing services across Hadoop

HDFS and MapReduce are also very “low-level” technologies which aren’t always very easy to use, especially for non-technical folk. Many tools have been built on top of them to provide a simpler interface, in particular for data warehousing and data analytics:

  • HBase – a distributed database for storing huge volumes of tabular data (“millions of columns, billions of rows”).
  • Hive – data warehouse infrastructure for ad-hoc querying. With its similar syntax, Hive is very useful in the hands of analysts who already dominate SQL.
  • Pig – data flow execution framework and procedural language which enables data processing and analysis. Like Hive, Pig creates MapReduce jobs in execution and is a good way to get away from low level programming.
  • Mahout – a scalable machine learning and data mining library. Mahout provides algorithms for clustering, classification, and collaborative filtering and is helpful in sorting through huge volumes of data to extract insight.

As you see, Hadoop isn’t one thing – it’s a collection of tools and technologies. When a client says “we want to use Hadoop”, I know there are still a lot of technology choices remaining. We work with our clients to understand their problem and design a solution which brings all the different pieces together. It definitely isn’t as easy as the press makes it seem.

Karl Rieder is Executive Consultant at GFT. His article appeared on Finextra.