Is there life after Hadoop? The answer is definitely yes.
Over the past few years, Hadoop, a powerful open source framework for storing and processing data named after stuffed elephants, has received many compliments. Of course, Hadoop can also claim that this is not the case. all Death-righteous mostly dead. Yet, many organizations that have invested heavily in the Hadoop ecosystem are at a crossroads, wondering what life after Hadoop looks like and what it will look like in the future. This article describes life after Hadoop and presents strategies for organizations entering the post-Hadoop era.
Remember the elephant
For many organizations, using Hadoop has been very good and has been able to process large amounts of unstructured data. For some users, the SQL-on-Hadoop solution also helped offload the work of more complex (and expensive) data warehouses. That said, Hadoop’s care and feeding, like any other elephant, was not… trivial. Especially when it’s time to clean the tub or the pen. I can’t stray any further from my personal pet trauma, but suffice it to say that living with a large animal had its downsides.
For example, a 3x replication scheme for data stored on the Hadoop Distributed File System (HDFS 2.X) caused a 200% overload on storage and other resources. Additionally, because Hadoop clusters do not separate compute and storage, they chronically underutilized compute resources (i.e. CPU) while constantly maximizing storage on the server. This contributed to another unfortunate by-product, the chaotic growth of Hadoop clusters. As data growth explodes, organizations continue to increase the number of Hadoop clusters, each with their own complex setup, inadequate IT usage, and an avid desire for storage. I noticed.
Yes, life with Hadoop has never been easier, and Hadoop-based applications are no exception. Hadoop MapReduce has provided a lot of power to manipulate large amounts of data, but at a cost. Due to the poor performance of large, on-disk MapReduce applications, it was not well suited to support the new wave of data-driven applications. Additionally, the Hadoop market itself has suffered major cataclysms in recent years, giving organizations a pause in their thinking about the future of Hadoop. Given the challenges and uncertainties, many organizations have come to the conclusion that it is time for the elephants to leave, despite their usefulness.
Of course, there were two basic questions that organizations planning to live after Hadoop had to answer.
- What should I do with Hadoop Distributed File System (HDFS) data?
- What do you do with a Hadoop-based application (such as MapReduce) consuming it?
The answer might have seemed simple and obvious at first, but many organizations have learned that the answer is not as straightforward or as straightforward as it initially thought. Managing huge amounts of data has never been easier, and the advent of distributed file systems like HDFS hasn’t eliminated the challenge. Moreover, MapReduce did not magically eliminate the challenges of distributed data-driven applications.
The bottom line is that Hadoop was designed and optimized for the data needs of another era. Today’s data landscape is very different from what it was ten years ago. Yet the two main drivers of data technology adoption remain the same. Price and performance. That said, Hadoop is not the leader in either of the two categories. This raises difficult questions, but there are few easy answers. However, two clear strategies have emerged to help organizations enter the post-Hadoop era.
1 – make a better lake
For a long time, Hadoop data lakes have been the preferred strategy for handling large amounts of unstructured data. Send everything to the lake and let the MapReduce app handle it. However, things weren’t that simple, and most data lakes still contained many copies and inefficient data movements. Additionally, new technologies challenge key data management assumptions and gradually replace Hadoop services such as HDFS and MapReduce, revealing that a better approach is needed to manage large amounts of data. I did.
Enter the data matrix. Basically, the data factory is a way to efficiently access and share data in a distributed environment. Brings together disparate data assets and make them accessible through a managed set of extensive data services. Basically, Data Fabric picked up where the Hadoop data lake left off. It provides efficient multiprotocol access to very large amounts of data, while helping to minimize data movement and provide much needed isolation between compute and storage.
In today’s data landscape, a single-protocol Hadoop data lake is not enough to meet today’s challenges. On the other hand, Modern data fabrics like HPE Ezmeral Data Fabric It provides HDFS-based access to centrally managed data assets, while offering significantly enhanced functionality.
2 – Optimize calculations
As mentioned earlier, the Hadoop MapReduce app has provided a lot of strength over the years to process data and works well for some tasks (such as distcp). However, performance in a variety of other use cases is better than ever. As a result, new services such as: spark Introduced to fill in the gaps in MapReduce.
Spark introduces significant innovations, goes beyond MapReduce’s limited functional grammar (mapping and reducing manipulation of rows of data) and takes a columnar approach to data in the form of a directed acyclic graph (DAG). Display of structured data. This approach is suitable for managing advanced workloads such as machine learning and graphical analysis. The combination of Spark’s innovations and its in-memory processing model also drastically improved performance. In some cases it is 100 times faster than MapReduce.
The significant improvements in Spark performance are due to several factors, including:
Spark is unrelated to I / O. Due to the in-memory processing model, Spark incurs no disk I / O performance penalty every time it performs a selected portion of a task.
Spark’s DAG enables optimization between job steps. Unlike Spark, Hadoop does not have a circular connection between MapReduce stages. In other words, you cannot adjust the performance at this level.
Additionally, Spark benefits from a flexible deployment architecture and Spark clusters can be deployed using various cluster managers such as Kubernetes and Hadoop YARN. Hadoop MapReduce is still the best choice for batch processing large amounts of data, but for most other use cases, Spark is a better choice.
Given Spark’s flexibility, AI and machine learning compatibility, and dramatically improved performance, its adoption has increased dramatically in recent years. Investing in Spark and related technologies is a sound strategy for the future.
Life after Hadoop
Over the past few years, we’ve seen dramatic changes in AI and data-driven applications and more diverse data storage. This change, along with the complexity of Hadoop, performance issues, and market integration, has resulted in a sharp decline in Hadoop usage, and many organizations have become skeptical of life after Hadoop.
Going forward, organizations should consider reducing their investment in Hadoop. Adopt the Spark + Data Fabric strategy instead of. HPE Ezmeral Software HPE Ezmeral Data Fabric, which provides an enterprise data structure, and HPE Ezmeral Container Platform, which offers extended support for Spark, and the ability to manage the remaining Hadoop assets in a container using a common control plane for Spark and Hadoop workloads. Cover the sides of.
By adopting HPE Ezmeral, the organization Transition to the post-Hadoop era You can focus on new data challenges for your business while freeing up time and resources.
About Randy Thomasson
Randy is a Global Solutions Architect for HPE Ezmeral Software, with technical leadership and strategy across a wide range of technologies and disciplines, including application development and modernization, big data and advanced analytics, l automation of infrastructure, in-memory data technologies and NoSQL. Provides architectural advice. DevOps.
Copyright © 2021 IDG Communications, Inc.
Is there life after Hadoop? The answer is definitely yes.