About us Wangda Tan • Last 5+ years in big data field, Hadoop, Open-MPI, etc. • Now • Apache Hadoop Committer @Hortonworks, all in YARN. • Now spending most of time on resource scheduling enhancements. • Past • Pivotal (PHD team, brings OpenMPI/GraphLab to YARN) • Alibaba (ODPS team, platform for distributed data-mining) Mayank Bansal • Hadoop Architect @ ebay • Apache Hadoop Committer • Apache Oozie PMC and Committer • Current • Leading Hadoop Core Development for YARN and MapReduce @ ebay • Past • Working on Scheduler / Resource Managers
Agenda • Overview • Problems • What is node label • Understand by example • Architecture • Case study • Status • Future
Overview – Background • Resources are managed by a hierarchy of queues. • One queue can have multiple applications • Container is the result resource scheduling, Which is a bundle of resources and can run process(es)
Overview – How to manage your workload by queues • By organization: • Marketing/Finance queue • By workload • Interactive/Batch queue • Hybrid • Finance- batch/Marketing-realtime queue
Problems • No way to specify for specific resource on nodes • E.g. nodes with GPU / SSD • No way for application to request nodes with specific resources. • Unable to partition a cluster based on organizations/workloads
What is Node Label? • Group nodes with similar profile • Hardware • Software • Organization • Workloads • A way for app to specify where to run in a cluster
Node Labels • Types of node labels • Node partition (Since 2.6) • Node constraints (WIP) • Node partition • One node belongs to only one partition • Related to resource planning • Node constraints • One node can assign multiple constraints • Not related to resource planning
Understand by example (1) • A real-world example about why node partition is needed: • Company-X has a big cluster, Each of Engineering/Marketing/Sales team has 33% share of the cluster. ... ... YARN RM Engineer 33% Marketing 33% Sales 33%
Understand by example (2) Engineer 50% Marketing 50% .. . .. . • Engineering/marketing team need GPU installed servers to do some visualization works. So they spent equal amount of money buy some machines with GPU. • They want to share the cluster 50:50. • Sales team spent $0 on these node nodes, so it cannot run anything on these new nodes at all.
Understand by example (3) • Here problem comes: • if you create a separated YARN cluster, ops team will unhappy. • If you add these new nodes to original cluster, you cannot guarantee engineering/marking team have preference to use these new nodes. ... ... YARN RM ... ... ?
Understand by example (4) ... ... YARN RM ... ... "Default" Partition "GPU" Partition Engineer 33% Marketing 33% Sales 33% Engineer 50% Marketing 50% • Node partition is to solve this problem: • Add GPU partition, which is managed by the same YARN RM. Admin can specify different percentage of shares in different partitions.
Understand by example (5) • Understand Non-exclusive node partition: • In the previous example, “GPU” partition can be only used by engineering and marketing team. • This is a bad for resource utilization. • Admin can define, if “GPU” partition has idle resources, sales queue can use it. But when engineering/marketing come back. Resource allocated to sales queue will be preempted. • (available since Hadoop 2.8) ... ... "Default" Partition ... ... "GPU" Partition Guaranteed to use Can use if it's idle Engineer Marketing Sales 33% 33% 33% 50% 50% 0%
Understand by example (6) • Configuration for above example (Capacity Scheduler) yarn.scheduler.capacity.root.queues=engineering,marketing,sales yarn.scheduler.capacity.root.engineering.capacity=33 yarn.scheduler.capacity.root.marketing.capacity=33 yarn.scheduler.capacity.root.sales.capacity=33 --------- yarn.scheduler.capacity.root.engineering.accessible-node-labels=GPU yarn.scheduler.capacity.root.marketing.accessible-node-labels=GPU --------- yarn.scheduler.capacity.root.engineering.accessible-node-labels.GPU.capacity=50 yarn.scheduler.capacity.root.marketing.accessible-node-labels.GPU.capacity=50 --------- (optional) yarn.scheduler.capacity.root.engineering.default-node-label-expression=GPU They’re original configuration without node partition Capacities For node partitions. Queue ACLs For node partitions. (optional) Applications running in the queue Will run in GPU partition By default
Understand by example (7) ... ... YARN RM Company (100%) R & D (50%) Sales (50%) QE (20%) Dev (80%) Without node partition YARN RM Company (100%) R & D (50%) Sales (50%) QE (20%) Dev (80%) With node partition Default GPU Company (100%) R & D (100%) Sales (0%) QE (50%) Dev (50%)
Architecture • Central piece: NodeLabelsManager • Stores labels and their attributes • Store nodes-to-labels mapping • It can be read/write by • CLI and REST API (which we called centralized configuration) • OR NM can retrieve labels on it and send to RM (we call it distributed configuration) • Scheduler uses node labels manager make decisions and receive resource request from AM, return allocated container to AM
Case study (1) – uses node label • Use node label to create isolated environment for batch/interactive/low-latency workloads. • Deploy YARN containers onto compute nodes are optimized and accelerated for each workload: • Using RDMA-enabled nodes to accelerate shuffle. • Using powerful CPU nodes to accelerate compression. • It is possible to DOUBLE THE DENSITY of today’s traditional Hadoop cluster with substantially better price performance. • Create a converged system that allow Hadoop / Vertica / Spark and other stacks share a common pool of data.
Case study (2) – uses node label
Case study (3) – Ebay cluster use node label • Separate Machine Learning workloads from regular workloads • Use node label to separate licensed software to some machines • Enabling GPU workloads • Separation of organizational workloads
Case study (4) – Slider use cases • HBase region servers run in nodes with SSD (Non-exclusive). • HBase master monopolize to use nodes. • Map-reduce jobs run in other nodes. And they can use idle resources of region server nodes. ... ... HBase Master (Exclusive) HBase Region Server (Non-Exclusive) Default Slider HBase Master RS RS Launches MR AM User Submit Task Task TaskTask
Status – Done parts of Node Labels • Exclusive / non-exclusive node partition support in Capacity Scheduler (√) • User-limit • Preemption • Now all respecting node partition! • Centralized configuration via CLI/REST API (√) • Distributed configuration in Node Manager’s config/script (√)
Status - Node Labels Web UI
Status – Other Apache projects support node label • Following projects are already support node label: • (SPARK-6470) • (MAPREDUCE-6304) • Slider (SLIDER-81) • (via SLIDER) • (via SLIDER) • (via SLIDER) • (AMBARI-10063)
Future of Node Label • Support constraints (YARN-3409) • Orthogonal to partition, they’re describing attributes of node’s hardware/software just for affinity. • Some example of constraints: • glibc version • JDK version • Type of CPU (x86_64/i686) • Physical or virtualized • With this, application can ask for resource • glibc.version >= 2.20 && JDK.version >= 8u20 && x86_64 • Support node label in FairScheduler (YARN-2497) • Support in more projects • Tez • Oozie • …