Saturday, July 6, 2019

Difference between Hadoop 2.x and Hadoop 3.x

Feature
Hadoop 2.x
Hadoop 3.x
Java Version
Java 7
Java 8, must
Fault tolerance
Achieved with replication
Via erasure coding
Storage
Replication=3 for data reliability
Which increase disk usage.
for ex. File A of 3 blocks occupies 3*3 blocks.
Storage overhead = 9 /3 * 100 = 300%
Erasure coding for data reliability
Under erasure coding the blocks are not replicated in fact HDFS calculates the parity blocks for all file blocks. 
Whenever the file blocks get corrupted, the Hadoop framework recreates using the remaining blocks along with the parity blocks.
Storage overhead is drastically reduced to more than 50%.
for ex. File A of 3 blocks occupies 9 blocks.
Storage overhead = 3/3 * 100 = 100%
Yarn Timeline Service
Scalability issues over data increases.
Yarn Timeline Service 1.x present since Hadoop 1.x, and its not scalable beyond small clusters. It has a single instance of writer and storage.
Yarn Timeline Service 2.x , provides for more scalability, reliability and enhanced usability. It has scalable back-end storage and distributed writer architecture.
Heap Size Management
Need to configure HADOOP HEAPSIZE
There are new ways to configure Map & Reduce daemon heap sizes. Auto tuning based on the memory of the host and globally. HADOOP_HEAPSIZE & JAVA_HEAP_SIZE variable is no longer used. We have HEAP_MAX_SIZE and HEAP_MIN_SIZE variables in MB. Also, if you want to enable the old default then configure HADOOP_HEAPSIZE_MAX in hadoop-env.sh.
Standby NN
Supports only 1 Standby NN, tolerating the failures of cluster.
Supports 2 and more Standby NN. Only One is in active state and others are in standby state.
Containers
Hadoop 2.x works on the principle of guaranteed containers.The container will start running immediately as there is a guarantee that the resources will be available. But it has drawbacks.
FeedBack Delays  Once the container finishes execution it notifies to RM. When RM schedules a new container at that node, AM gets notified. Then AM starts the new container. Hence there is a delay in terms of notifications to RM and AM.
Allocated v/s utilized resources – The resources which RM allocates to the container can be under-utilized. For ex, RM may allocates container of 4 GB and out of which it uses only 2 GB. This reduces effective resource utilization.
 Hadoop 3.x implements opportunistic containers.
Containers wait in a queue if the resources are not available.
The opportunistic containers have less priority than guaranteed containers.
Hence, the scheduler attempts opportunistic containers to be available for guaranteed containers.
Port Numbers for multiple services
It uses ephemeral port numbers range (32768-61000), which lead failure of Hadoop services in startup in Linux Server.
The ephemeral port numbers changes affected to NN, SN and DN port numbers.
Name Node ports: 
50470 –> 9871,
50070 –> 9870,
8020 –> 9820
Secondary Name Node ports: 
50091 –> 9869,
50090 –> 9868
Data Node ports: 
50020 –> 9867,
50010 –> 9866,
50475 –> 9865,
50075 –> 9864
Data Load balancer
A single Data Node manages many disks. These disks fill up during a normal write operation. But, adding or replacing disks can lead to significant issues within a Data Node. Hadoop 2.x has HDFS balancer which cannot handle this situation. 
New intra-Data Node balancing functionality handles the above situation.
diskbalancer CLI invokes intra-DataNode balancer.
To enable  this,
dfs.disk.balancer.enabled=true on all DataNodes.
File System Support
Local filesystems,HDFS (Default FS), FTP File system, Amazon S3 (Simple Storage Service) file system, Windows Azure Storage Blobs (WASB) file system, Distributed Filesystems, etc.
It supports all the previous one as well as Microsoft Azure Data Lake filesystem and Aliyn object storage system.
Scalability
Cluster can be scale upto 10000 nodes.
Cluster can be scale more than 10000 nodes.