Feature
|
Hadoop 2.x
|
Hadoop 3.x
|
Java
Version
|
Java
7
|
Java
8, must
|
Fault
tolerance
|
Achieved
with replication
|
Via
erasure coding
|
Storage
|
Replication=3
for data reliability
Which
increase disk usage.
for
ex. File A of 3 blocks occupies 3*3 blocks.
Storage
overhead = 9 /3 * 100 = 300%
|
Erasure
coding for data reliability
Under erasure coding the blocks are not replicated in fact
HDFS calculates the parity blocks for all file blocks.
Whenever the file blocks get corrupted, the Hadoop
framework recreates using the remaining blocks along with the parity blocks.
Storage overhead is drastically reduced to more than
50%.
for
ex. File A of 3 blocks occupies 9 blocks.
Storage
overhead = 3/3 * 100 = 100%
|
Yarn
Timeline Service
|
Scalability
issues over data increases.
Yarn
Timeline Service 1.x present since Hadoop 1.x, and its not scalable beyond
small clusters. It has a single instance of writer and storage.
|
Yarn
Timeline Service 2.x , provides for more
scalability, reliability and enhanced usability. It has scalable back-end
storage and distributed writer architecture.
|
Heap
Size Management
|
Need
to configure HADOOP HEAPSIZE
|
There are new ways to configure Map & Reduce daemon heap sizes. Auto tuning based on the memory of the host and globally.
HADOOP_HEAPSIZE & JAVA_HEAP_SIZE variable is no longer used.
We have HEAP_MAX_SIZE and HEAP_MIN_SIZE variables in MB.
Also, if you want to enable the old default then configure HADOOP_HEAPSIZE_MAX in hadoop-env.sh.
|
Standby
NN
|
Supports
only 1 Standby NN, tolerating the failures of
cluster.
|
Supports
2 and more Standby NN. Only One is in active state and others are in standby
state.
|
Containers
|
Hadoop 2.x works on the principle of guaranteed containers.The
container will start running immediately as there is a guarantee that the
resources will be available. But it has drawbacks.
FeedBack Delays – Once the container finishes execution it notifies to RM.
When RM schedules a new container at that node, AM gets notified. Then AM
starts the new container. Hence there is a delay in terms of notifications to
RM and AM.
Allocated v/s utilized
resources – The resources which RM allocates to the container can be
under-utilized. For ex, RM may allocates container of 4 GB and out of which
it uses only 2 GB. This reduces effective resource utilization.
|
Hadoop 3.x implements
opportunistic containers.
Containers wait in a queue if the resources are not available.
The opportunistic containers have less priority than
guaranteed containers.
Hence, the scheduler attempts opportunistic containers to
be available for guaranteed containers.
|
Port Numbers for multiple services
|
It uses ephemeral port numbers range (32768-61000), which
lead failure of Hadoop services in startup in Linux Server.
|
The ephemeral port numbers changes affected to NN, SN and DN
port numbers.
Name Node ports:
50470 –> 9871,
50070 –> 9870,
8020 –> 9820
Secondary Name Node ports:
50091 –> 9869,
50090 –> 9868
Data Node ports:
50020 –> 9867,
50010 –> 9866,
50475 –> 9865,
50075 –> 9864
|
Data Load balancer
|
A single Data Node manages many disks. These disks fill up during
a normal write operation. But, adding or replacing disks can lead to
significant issues within a Data Node. Hadoop 2.x has HDFS balancer which
cannot handle this situation.
|
New intra-Data Node balancing functionality handles the
above situation.
diskbalancer CLI invokes intra-DataNode balancer.
To enable this,
dfs.disk.balancer.enabled=true on all DataNodes.
|
File System Support
|
||
Tips and good resources for all.. Oracle, Big data, Hadoop, Unix, Linux