hadoop yarn architecture


in this chapter we will look at YARN, Yet
Another Resource Negotiate in hadoop 2.x which is Hadoop scalable
compute platform so let’s quickly look at the agenda
today so I’ll try to give you a short intro into the Hadoop and MapReduce
version 1 and we’ll talk about the inadequacies of MapReduce version 1 then
we’ll get into the new architecture of yarn a quick introduction into yarn
which is MapReduce version 2 and a quick insight into how to develop young
applications so let’s get started so how do clusters can scale from single
nodes in which all Hadoop entities operate on the same node to thousands of
nodes their functionality is distributed across the nodes to increase the
parallel processing activities so a figure here illustrates the high level
components of the Hadoop cluster so a Hadoop cluster can be divided into
two abstract entities a MapReduce engine okay which is part of the job tracker
and a distributed file system which is part of the name node the MapReduce
engine provides the ability to execute map and reduce jobs across the cluster
and report the results where the distributed file system provides a
storage schema that can replicate data across the nodes for processing as you
can see here the data is replicated across four nodes in the cluster which
are also the data nodes the Hadoop distributed file system was defined to
support in large files where files are commonly multiples of 64 MB data blocks
each when a client makes a request to a Hadoop cluster this request is managed
by the job tracker the job tracker working with the name node distributes
work as closely as possible to the data on which it will work which is also
called as a data local to principal the name node is a master node or a master
of the file system providing metadata services for data distribution and
replication moving on to the job tracker it schedules map and reduce tasks into
available slots at one or more task trackers so as you notice trackers run
on each of the data nodes the tasks track of working with the data node to
execute the map and reduce tasks on the data from the data node
when the map and reduce tasks are complete the task tracker notifies back
to the job tracker which identifies when all the tasks are complete and
eventually notifies the client of the job completion as can see from this
picture Map Reduce washerman implements a relatively straightforward cluster
manager for MapReduce processing so MapReduce version 1 provides a
hierarchical scheme for cluster management in which big data jobs filter
into a cluster as individual map and reduce tasks and eventually aggregate
back to the job reporting to the user but in this simplicity lies some hidden
and not so hidden problems first version of MapReduce has been both a strength
and the weakness so MapReduce version 1 is a standard Big Data crossing system
in use today however this architecture does have
inadequacies mostly coming into play for larger clusters as cluster size sizes
exceeded 4,000 nodes where each node could be multi-core some amount of
unpretty bility surfaced one of the biggest issues was cascading failures
where the failure resulted in a serious deterioration of the order cluster
because of attempts to replicate data and overload live nodes through network
flooding but the biggest issue with the MapReduce version 1 is a multi-tenancy
as clusters increase in size is desirable to employ these clusters for a
variety of models MapReduce version 1 dedicates its nodes to Hadoop where it
is desirable to repurpose them for other applications and workloads as big data
and hadoop become an even more important use model for the cloud deployments this
capability will increase because it permits physical ization of Hadoop on
servers compared to the requirement the virtualization and it’s added
management computational and input/output overhead let’s now look at
the new architecture of yarn or yet under resourcing negotiator to see how
it supports MapReduce version 2 and other applications using different
processing models a Hadoop cluster can be scaled completely linearly by adding
more computers or nodes to the cluster computers in the clusters are called
nodes as we know we already seen how the increase in the storage capacity of the
cluster can be achieved by adding more nodes with the disks attached so these
nodes also have their own memory and processing capacity so we are also
making more memory and processing capacity available to the cluster as a
whole with an increased number of nodes in the cluster the compute time for the
job for a given job politically significantly because the job can be
spread across more processes as in before the cost associated with
the cluster parolee nearly cost per unit storage and for that matter the unit of
processing will remain constant and as this fiscal before the compute time will
decrease significantly because now we have more processing power at our
disposal Hadoop is basically designed for cost-effective scaling out the very
large number of nodes the largest cluster that are in use today at 40,000
plus nodes and a bigger no classes of more than 40,000 nodes are also known to
work Gyan provides a generic processing platform for running different processes
at the same time we might use different processing engines which can be written
in a variety of programming languages so yan supports running of many thousands
of applications running concurrently the core components of hadoop are HDFS as
can see in this picture which is a distributed file system and Yan or the
distributed computing for yan is the architectural center of Hadoop
specifically designed to support many different processing engines such as
MapReduce tez SPARC stomp non interactive SQL real-time processing
real-time streaming data science and batch processing to handle data stored
in a single platform unlocking an entirely new approach to analytics
high-level language such as hype and pick are compiled into MapReduce or test
jobs where they are executed other engines have appeared recently which
provide new capabilities using spark and stomp for real-time processing before
the advent of yarn Hadoop was mostly batch only processing model yarn layer
is introduced in how to version 2 which allows other processing engines to be
plugged into Hadoop including some of which give near real-time capabilities
the Hadoop is no longer patch only because of the recent developments
John provides dramatic scalability improvements over Hadoop previous row
question so basically yarn decouples the programming model from this resource
management infrastructure yarn delegates much of
the job management tasks to the nodes in the cluster rather than in the master
service as we have seen in Hadoop one where job tracker was taking care of the
management tasks and yon also supports much bigger clusters of 40 thousand or
more yan is fundamental foundation of the new generation of Hadoop and is
enabling organizations everywhere to realize a modern data architect so for
the enterprise hurroo beyond is a prerequisite providing resource
management and a central platform to deliver consistent operations security
and data governance tools across hurroo clusters yarn also extends the power of
Hadoop to incumbent a new technologies found within the data center so that
they can take advantage of the cost-effective Lamia scale storage and
processing it provides developers a consistent framework for writing data
access applications that run on Hadoop so with the advent of yarn you are no
longer constrained by the simple MapReduce paradigm of development but
get instead create more complex distributed applications in fact you can
think of the MapReduce model as simply one more application in the set of
possible applications that the yarn architecture can run in effect exposing
more of the four underlying framework for customized development this is
powerful because the use model of yan is potentially limitless and no longer
requires segregation from other more complex distributed application
frameworks that may exist on a cluster like MapReduce version one did it could
even be said that a yawn becomes more robust it may be able to replace some of
these other distributed processing frameworks completely freeing up
resource head dedicated to these other frameworks as well as simplifying the
overall system so as you can see in this picture at the root of the yarn
hierarchy is something called as a resource manager which is the master
daemon this entity governs an entire cluster
and manages the assignment of applications to underlying compute
resources the resource manager orchestrates the division of resources
which are basically the compute memory bandwidth GPU etc to the underlying node
managers so these node managers are yarns per
node agents like if you look at how to version 1 dot X you have something
called as task trackers but here you have something called as Nord managers
so the resource manager also works with something called as application masters
as you can see here to allocate the resources and work with the node
managers to start and monitor their underlying application in this context
the application master has taken some of the role of the prior task tracker and
the resource manager has taken the role of the job tracker an application master
manages each instance of an application that runs within the yard the
application master is also responsible for negotiating resources from the
resource manager and through the node manager monitoring the execution and
resource consumption of container like CPU memory etc so you may also note that
all the resources today are more traditional like CPU cores memory
tomorrow will bring new resource types based on the task at hand
for example graphical processing units or specialized processing devices from
this perspective of Y on application masters are user code and therefore a
potential security issue yan assumes that application masters are buggy or
even malicious and therefore treats them as unprivileged code the node manager
manages each node with me rpm cluster the node manager is a worker daemon
which provides per node services within the cluster from overseeing the
management of a container over its life cycle to monitoring resources and
tracking the health of its node unlike MapReduce version 1 which managed
execution of map and reduce tasks where slots
the node manager manages abstract containers which represent four node
resources available for a particular application yon continues to use the
HDFS layer with its master name node for metadata services and also edit logs and
data node for the replicated storage services across the cluster each node
manager tracks its own local resources and communicates its resources
configuration to them a resource manager as you can see in this particular arrow
here which keeps a running total of the clusters available resources by keeping
track of the total the resource manager knows how to allocate the resources as
they require so use of a young cluster begins with a request from a client
consisting of an application the resource manager then negotiates the
necessary resources for a container and launches an application master to
represent the sub submitted applications using a resource request protocol the
application master negotiates the resource containers for the application
at each node upon execution of the application the application master
monitors the container until completion when the application is complete the
application master and registers its container with it with the resource
manager and the entire cycle is complete so let’s also quickly talk about
containers here so containers are an important yarn concept you can think of
a container as a request to hold resources on the yarn cluster currently
a container holds requests consisting of vcore and memory once a hole has been
granted on the host the node manager launches a process called a task an
application is a yon client program that’s made up of one or more tasks so
for each running application a special piece of code called an application
master helps coordinate tasks on the yang cluster the application master is
the first process run after the application starts an application
running tasks on a yawn cluster consists of the following steps first the
application starts and talks to the resource manager for the cluster the
resource manager makes a single can in a request on behalf of the
application the application master starts running within the container then
after his master requests subsequent containers from the resource manager
that are allocated to run tasks for the application those tasks do most of the
state’s communication without his master once all tasks are finished the
application master exists exits and the last container is de-allocated from the
cluster and finally the application client exists
so here the application master launched in a container is most basically called
as a managed him unmanaged application masters run outside the Yantz control so
finally a point that should be clear from this discussion is that the older
Hadoop architecture which is OneNote X was highly constrained through the job
tracker which was responsible for resource management and scheduling jobs
extra across the cluster and because of that job tracker was clearly
overburdened and it is frequently prone to crashes the new young architecture
breaks breaks down this model allowing a new resource manager to manage resource
usage across the applications with application masters taking the
responsibility of managing the execution of jobs this change removes a bottleneck
and also improves the ability to scale Hadoop clusters to much larger
configuration than previously possible in addition beyond traditional MapReduce
yarn also permits simultaneous execution of a variety of programming models
including graph processing iterative processing machine learning and general
cluster computing using standard communication schemes like the message
passing interface and finally if you are interested to develop beyond
applications so you can take a look at some of the options available here so
you can even develop a simple Java program which can be converted into a
young application right so yawn internally provides capabilities to
build custom application frameworks on top of Hadoop so a boilerplate
implementation of this lifecycle is also available under
a github project called kitten so the URL is already provided so as a
developer you don’t have to worry about implementing the core capabilities of
yarn so instead you can use the current project and bit logic is going to
provide you the complete implementation and you just have to focus on your
application logic so in in the next video
we will touch upon some of the new components features that are available
as part of Hadoop 2 and that core differences between Hadoop 1 and Horeb 2
so if you enjoy this video please subscribe to the channel for more
upcoming videos thank you

Add a Comment

Your email address will not be published. Required fields are marked *