Hadoop 3 is a major version of a framework for developing and running distributed programs. In 2017 Hadoop 3 was released. For users of cloud Hadoop versions 1 and 2, we will tell you about the new features of the third and why it is worth switching to it.
Hadoop 2 uses replication to provide fault tolerance. It means that all stored data is in a redundant state. By default, the replication factor is three, which means that all information in three replicas will be held: master data plus two copies.
For example, a 1 GB file is stored in four 256 MB blocks. For each of these blocks, two additional copies are created, resulting in 3 GB. The storage redundancy is 200%.
Hadoop 3 takes a different approach – Erasure Coding. It is a data fragmentation technique that creates only a few extra parity blocks. They store information on how to restore the rest of the blocks if they are unavailable. Learn more about how it works.
Similarly, the Erasure Coding approach will use only two additional blocks, and together the data will take up 1.5 TB. It has 50% storage redundancy.
In Hadoop 2, there can be one active and one standby management node (NameNode). It means that the cluster can only survive the failure of one node. If two nodes fail, the group will be unavailable.
Hadoop 3 can provide better fault tolerance – it now supports multiple redundant NameNodes. Thus, the cluster can withstand the failure of two or more control nodes.
YARN was designed to scale up to 10,000 nodes. The new version of Hadoop introduces the YARN Federation, which will allow YARN to scale to 40,000 or even 100,000 nodes. It is useful for high-load applications
The YARN resource model has been generalized to support additional quantifiable resource types in addition to CPU and RAM. For example, a cluster administrator can define resources such as GPUs, software licenses, or locally attached storage.
It allows you to schedule tasks in YARN based on the availability of these resources. For example, you might consider the number of available software licenses or free disk space when planning lessons.
The Timeline Server is responsible for collecting and searching information about running applications in the cluster. Timeline Service 2.0 is the next major iteration of Timeline Server since v.1. Version v.2 solves two main problems of versions v.1:
Hadoop 3 introduces new methods for tuning heap sizes. Auto-tuning is now possible based on host memory size, and the HADOOP_HEAPSIZE variable is deprecated. The desired heap size no longer needs to be specified in the task configuration and Java parameter.
Hadoop 2 runs on Java Development Kit 7. Apache Hadoop 3 has migrated to Java Development Kit 8. In addition to Hadoop 3 itself, JDK 8 support is present in its HBase 2.0, Hive 3.0, and Phoenix 3.0 components.
Also Read: How And Why “Auchan” Built A Platform For Working With Big Data In A Public Cloud
Key Takeaways Understand current innovations reshaping payroll processes. Learn how automation improves payroll accuracy and…
Convert URL To MP3: Your Comprehensive Guide To Easy Online Conversions Description: Discover how to…
Spending a lot of time on the internet, I am always looking for tools that…
Due to the abundance of options available in the field of cloud storage, it may…
Lately, I have been searching for YouTube alternatives. Even though I enjoy YouTube for its…
Internet marketing and entrepreneurship are dynamic fields, but BizGurukul assists fresh and experienced marketing personnel.…