Skip to main content

Three ways to leverage big data analytics using Hadoop

Many companies want to take the plunge into a big data project using the open source Hadoop framework but are concerned about technical, cost, and control issues. Here's a breakdown of the basic considerations for three different types of Hadoop deployments.

The world is awash in data. It comes from all kinds of sources—applications, mobile devices, point-of-sale terminals, assembly line sensors, industrial robots, warehouse scanners, video cameras, and increasingly, emerging technologies such as self-driving vehicles and drones. 

As the data available to companies and other organizations increases, the question of how to derive value from it has become an urgent priority. Big data analytics offers a way to identify patterns and correlations in massive datasets and apply them to real-world problems.

For instance, in the financial services industry, banks and securities firms use big data to support compliance efforts and spot signs of money laundering and fraud. In manufacturing, big data can be used for bottleneck analysis, better QA, and supply chain optimization. Retailers can use big data to improve customer segmentation and store design.

These and other applications of big data analytics are encouraging more companies to make investments in the technology. According to IDC, global revenue for big data and business analytics will experience a compound annual growth rate of 11.9 percent over the next several years, reaching more than $210 billion by 2020.

IDC insights: Digital transformation trends in IT services

When it comes to working with big data, the open source Apache Hadoop ecosystem is widely supported by major vendors and has superior cost benefits and scalability when compared with traditional relational database management systems. Hadoop can power everything from deep analysis of customer buying behavior to combing through millions of photographs that match a certain pattern. Unlike traditional databases, Hadoop can work with both structured and unstructured data. It’s no wonder that companies as diverse as Netflix, JPMorgan, and BMW have used Hadoop and big data to understand their customers and improve their businesses.

There are several ways to launch a big data project based on Hadoop:

  1.      Deploy Hadoop in a corporate data center
  2.      Leverage a cloud-based Hadoop service
  3.      Use an on-premises Hadoop service supported and operated by an experienced vendor

The pros and cons of the three approaches are outlined below.

Why companies fear a Hadoop rollout in their own data centers

A company may be interested in Hadoop for its own big data applications or may want to offer Hadoop as a service to its customers. Regardless, configuring a corporate data center to support Hadoop is fraught with challenges, including:

  • Technical. Many organizations may not have the internal expertise to set up or manage this highly complex technology.
  • Time to value. Business and IT managers are interested in gaining new insights from big data or offering big data as a service to customers. The time required to get a working cluster up and running may determine the winners and losers, with faster companies gaining a competitive edge.
  • Security and control. There are security issues around sensitive data, including banking details, health data, and records subject to compliance requirements.
  • Costs. Companies want to understand the ongoing costs of running Hadoop—and how much they will need to spend to expand capacity.  

The technical requirements for Hadoop are unique. The distributed computing model was designed for a hardware environment with a specific configuration, which makes the hardware setup straightforward. Both structured and unstructured data—whether it’s call logs, video feeds, transaction records, or some other data format—are stored on server nodes with local disk drives instead of a dedicated storage system.

In a Hadoop cluster, head nodes handle management and control tasks. Each “worker” node holds a small subset of data and processes commands on multiple servers in parallel. Even if one server fails, there are two other servers that hold the same data, and they can proceed to process it. While Hadoop can suffer outages because of capacity limits or a misconfigured system, it’s difficult to crash Hadoop because of a hardware failure. In addition, Hadoop is designed to quickly scale—to add more capacity or simply add more worker nodes. This built-in fault tolerance and scalability make Hadoop such an attractive framework.

The complexity of Hadoop is not in the hardware, but rather in the software. The ecosystem is a zoo of standard tools with unusual names (YARN, Spark, Pig, and Mahout, to name a few) as well as specialized applications sitting on top of the hardware. To have all of these components sync up and work in unison with the underlying operating system, middleware, and firmware in the server hardware is a huge challenge. I’ve talked with customers who need big data analytics but don’t have the engineering talent required to hold everything together. Even if an organization does have data scientists and engineers on staff, there is real worry about how long it will take to build an operational Hadoop environment.

Hadoop in the cloud won’t work for everyone

There is increasing interest in Hadoop deployments in the cloud. The idea is simple: Instead of a Hadoop environment running in the data center, the data and analytics are run on virtual Hadoop clusters provided by a cloud services provider.

Vendors offering cloud-based Hadoop services promise to take away the pain of complex Hadoop deployments, as well as reduce staff and other operational costs. It’s an attractive proposition for companies that don’t have the in-house technical expertise to deploy and manage Hadoop on their own. 

The public cloud won’t work for everyone, though. Customers in certain verticals as well as many service providers won’t upload sensitive data to the cloud because of security restrictions, regulatory requirements, and a desire to avoid latency and cloud service outages.

On-premises Hadoop supported and operated by a trusted services provider

There is a third path companies can take that addresses concerns over complexity and the cloud: an on-premises Hadoop architecture supported and operated by a trusted partner.

This type of relationship entails more than merely hiring consultants to set up a Hadoop cluster. A partner with deep Hadoop expertise can take care of the early heavy lifting required to get a working environment and provide critical reference architectures that can simplify and speed deployments. Importantly, the partner should be around for the long haul, to help clients operate the Hadoop ecosystem and scale it up as needed. Control resides with the customer as opposed to a cloud service provider.

An on-premises Hadoop architecture supported and operated by a partner can also address concerns around capacity planning. This is a big headache for businesses worried about not having enough available capacity, or overprovisioning and paying for resources that are never used.

Depending on the vendor, this type of Hadoop deployment can provide pay-as-you-go pricing, a critical feature for big data environments with rapid growth or unpredictable needs. Metered usage depends on the number of compute and storage nodes, rather than the number of processors, user licenses, or hourly service fees. Growth of the platform is simple: provision more worker nodes to increase capacity. This big data consumption model is similar to cloud-based pricing but offers security benefits associated with the data staying on premises, instead of on someone else’s servers.

Hadoop has opened up new ways of deriving value from diverse data sources. Using Hadoop for big data analytics can lead to new products, better efficiency, competitive advantages, and insights into customer behavior. As the market for Hadoop services matures, companies have more options when it comes to deploying and managing big data projects. These options not only reduce technical hurdles, but also speed time to value through faster deployment and quicker availability of insights from big data analytics.

How to leverage big data analytics using Hadoop: Lessons for leaders

  • Using a distributed computing framework, Hadoop offers new possibilities for finding patterns and correlations in big datasets that are far more efficient and cost effective than using traditional relational databases.
  • Hadoop uses a standard hardware architecture that is very simple to design and implement. However, the Hadoop software environment can be very complex and requires special skills to design and implement.
  • Cloud-based Hadoop installations are widely available and offer simple metered pricing, but they may not work for organizations that have stringent security, latency, or uptime requirements.
  • An on-premises Hadoop environment supported and operated by a trusted partner can not only reduce technical complexity, but also offer a simple pay-as-you-go pricing model.

This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.