Hadoop
What is Apache Hadoop?
Apache Hadoop provides an open-source framework that enables the distributed processing of large data sets across clusters of compute resources. Its design can scale from single servers to thousands, each offering local compute and storage capabilities.
Why is Hadoop useful?
The explosion of Big Data and data-collecting devices throughout business operations offers companies significant opportunities to innovate and succeed. Because Hadoop detects and handles failures at the application layer, rather than the hardware layer, Hadoop can deliver high availability on top of a cluster of computers, even though the individual servers may be prone to failure.
How was Hadoop developed?
Hadoop was born out of a need to process increasingly large volumes of Big Data and was inspired by Google’s MapReduce, a programming model that divides an application into smaller components to run on different server nodes. Unlike the proprietary data warehouse solutions that were prevalent at the time it was introduced, Hadoop makes it possible for organisations to analyse and query large data sets in a scalable way using free, open-source software and off-the-shelf hardware. It enables companies to store and process their Big Data with lower costs, greater scalability and increased computing power, fault tolerance and flexibility. Hadoop also paved the way for further developments in Big Data analytics, such as Apache Spark.
What are the benefits of Hadoop?
Hadoop has five significant advantages that make it particularly useful for Big Data projects. Hadoop is:
1. Scalable
Because it can store and distribute large data sets across hundreds of inexpensive servers that operate in parallel, Hadoop is highly scalable. Unlike traditional relational database management systems (RDBMSes), Hadoop can scale up to run applications on thousands of nodes involving thousands of terabytes of data.
2. Flexible
Hadoop can tap into both structured and unstructured data to generate value. This allows businesses to derive business insights from a variety of data sources, such as social media channels, website data and email conversations. In addition, Hadoop can be used for purposes ranging from recommendation systems, log processing and data warehousing to marketing campaign analysis and fraud detection.
3. Cost-effective
Traditional RDBMSes are extremely cost prohibitive to scale enough to process Big Data volumes. Companies using such systems previously had to delete large amounts of raw data, as it was too expensive to keep everything they had. In contrast, Hadoop’s scale-out architecture makes it much more affordable for a company to store all its data for later use.
4. Fast
Hadoop employs a unique storage method based on a distributed file system that maps data wherever it is located on a cluster. Plus, its tools for data processing are often on the same servers where the data is located, allowing for much faster data processing. Because of these features, Hadoop can efficiently process terabytes of unstructured data in minutes and petabytes in hours.
5. Fault tolerant
Data stored on any node of a Hadoop cluster is replicated on other nodes of the cluster to prepare for the possibility of hardware or software failures. This intentionally redundant design ensures fault tolerance. If one node goes down, there is always a backup of the data available in the cluster.
Hadoop makes handling large data sets safely and cost-effectively much easier as compared to RDBMSes. And its value for a business increases as the amount of unstructured data that the organisation possesses grows. Hadoop is well suited for search functionality, log processing data warehousing, and video and image analysis.
How does Hadoop work?
HDFS
The Hadoop Distributed File System (HDFS) allows massive amounts of data to be stored in various formats and distributed across a Hadoop cluster. It provides high throughput access to application data and is suitable for applications that have large data sets. Unlike some other distributed systems, HDFS is highly fault tolerant, designed using low-cost hardware and run on commodity hardware.
MapReduce
The MapReduce module is both a programming model and a Big Data processing engine used for the parallel processing of large data sets. With MapReduce, the processing logic is sent to various slave nodes and then the data is processed in parallel across these different nodes. The processed results are then sent to the master node where they are merged and this response is sent back to the client. Originally, MapReduce was the only execution engine available in Hadoop, but later Hadoop added support for others, such as Apache Tez and Apache Spark.
YARN
Hadoop’s Yet Another Resource Negotiator (YARN) is another core component in the Hadoop framework. It is used for cluster resource management, planning tasks and scheduling jobs that are running on Hadoop. It allows for parallel processing of the data stored across HDFS. YARN makes it possible for the Hadoop system to make efficient use of available resources, which is crucial for processing a high volume of data.
How is Hadoop used?
Companies in a range of industries ae using Hadoop for Big Data analytics to drive many benefits to their organisations.
Financial services firms
Financial organisations are leveraging Hadoop to make critical investment decisions and reduce risk. Financial and banking firms use Big Data analysis to approve and reject loan and credit card applicants with greater accuracy. This analysis is also used to identify potentially suspicious account activity based on past purchasing behaviour. Insurance companies are also using Hadoop to help them detect and prevent fraudulent claims. Medical insurers can leverage Big Data to formulate policies tailored to specific patient demographics. Hadoop is also being used to gain insights from online chat conversations with customers to improve the quality of service delivery and to create more personalised customer experiences.
Telecommunications
Telecom providers regularly generate large amounts of data at massive velocity and maintain billions of call records. Big Data is used to help generate accurate billing details for millions of customers and estimate future bandwidth demand and customer communication trends. This information is then used for future infrastructure planning as well as for the creation of new products and services for customers.
Healthcare
The healthcare industry has enormous amounts of data available to it through patient records, research and trial data, electronic health devices and more. Hadoop provides unconstrained parallel data processing, fault tolerance and storage for billions of medical records. The platform is also used to analyse medical data, which can then be used both to evaluate public health trends for populations of billions and to create personalised treatment options for individual patients based on their needs.
Retail
The massive amounts of data retailers generate today requires advanced processing. Historical transaction data can be loaded into a Hadoop cluster to build analytics applications to predict demand, forecast inventory, create targeted promotions and predict consumer preference.
HPE solutions for Hadoop
The HPE Elastic Platform for Big Data Analytics (EPA) is designed as a modular infrastructure foundation to address the need for a scalable multi-tenant platform. It does this by enabling independent scaling of compute and storage through infrastructure building blocks that are optimised for density and workloads. Two different deployment models are available:
- HPE Balanced and Density Optimized (BDO) system: Supports conventional Hadoop deployments that scale compute and storage together, with some flexibility in choice of memory, processor and storage capacity.
- HPE Workload and Density Optimized (WDO) system: Harnesses the power of faster Ethernet networks and enables a building block approach to independently scale compute and storage, letting you consolidate your data and workloads growing at different rates.
HPE also offers a scalable solution that radically simplifies your experience with Hadoop. It allows you to offload much of the complexity and cost of your Hadoop environment so that you can focus on deriving intelligence from your Hadoop cluster(s). Offering support for both symmetrical and asymmetrical environments, HPE GreenLake offers a complete end-to-end solution for Big Data that includes hardware, software and services. HPE experts will get you set up and operational, and help you manage and maintain your cluster(s). They will also simplify billing, aligning it with business KPIs. With HPE’s unique pricing and billing method, it is much easier to understand your existing Hadoop costs and to better predict future costs associated with your solution.