Running Multiple Schedulers in Kubernetes Xiaoning Ding, Principal Architect, Huawei
Agenda Default Scheduling in Kubernetes Why Multiple Schedulers How It Works in Kubernetes Improvements in Huawei HOSS Comparison and Summary Q & A
Default Scheduling in Kubernetes ETCD 1 Incoming Pods 2 Persist API Server Watch Pods with nodeName = $nodeName Launch pods Watch Pods with nodeName = “” Schedule and update nodeName (binding) 4 3 Kubelet Kubelet Kubelet Scheduler Node 1 Node 2 Node n
Default Scheduling in Kubernetes (Cont.) Predicates PodFitsResources PodFitsPorts MatchNodeSelector … ImageLocalityPriority BalancedResourceAllocation LeastRequestedPriority … Priorities Predicates: the scheduling rules that filter out unqualified nodes. Priorities : the scheduling rules that rank remaining nodes according to preferences. Scheduling policy: a combination of predicates and priorities.
Why Multiple Schedulers Scenario: diverse workloads on a shared cluster Better flexibility and maintainability Easier to try out new schedulers Easier to develop, maintain and evolve Better availability and scalability Add, update or remove schedulers without downtime Ability to scale out schedulers on process level
How It Works in Kubernetes: Job Dispatch Annotations: scheduler.alpha.kubernetes.io/name = “Scheduler1” 1 Incoming Pods ETCD 2 Persist API Server Watch Pods with nodeName = “” Check scheduler name annotation, drop if not match Schedule and bind 3 4 Watch Pods with nodeName = $nodeName Launch pods Kubelet Kubelet Kubelet Scheduler 1 Configure Scheduler names Scheduler 2 Node 1 Node 2 Node n
How It Works in Kubernetes: Conflict Detection Incoming Pods Request.memory = 2 G 1 ETCD P1 2 Persist P2 API Server P1.nodeName = “node 1” Watch Pods with nodeName = $nodeName Re-run all predicates Launch pods 3 4 Scheduler 1 Kubelet P2.nodeName = “node 1” 3 Available. memory = 3 G Scheduler 2 Node 1
HOSS: What We Want to Improve Job dispatch Instance-level load balancing Dynamic scheduling policy Conflict resolution Early conflict detection and rescheduling Flexible conflict criteria Conflict resolution policies
HOSS: Job Dispatch 1 2 ETCD API Server 3 Scheduler Controller 4 5 Annotations: scheduler.alpha.kubernetes.io/type = “Type1” scheduler.alpha.kubernetes.io/policy = “PolicyA” 1 Incoming Pods ETCD 2 Persist API Server Assign scheduler instances dynamically by updating “name” annotation 3 instance1 instance2 Scheduler Controller Configure scheduler type and instance names Scheduler Type 1 4 Watch and schedule pods based on specified scheduling policy 5 instance1 instance2 instance3 Watch and launch pods Kubelet instance4 Scheduler Type 2 Node
HOSS: Conflict Detection 1 Incoming Pods ETCD 2 Persist API Server Conflict Resolver Watch Pods with nodeName = $nodeName Re-run all predicates Launch pods Watch and schedule pods Re-schedule the current pod if binding failed 4 Perform conflict checks based on specified criteria 3 Kubelet Schedulers Node 1
HOSS: Multi-level Conflict Criteria Strong Weak Customizable Based on node resource versioning mechanism. Version mismatch is a conflict. Based on node resource quantity. Only resource insufficiency is a conflict. Based on a Boolean expression of node properties and pod properties. Negative evaluation result is a conflict.
HOSS: Conflict Resolution Policies Pod-priory-based conflict resolution Scheduler-priority-based conflict resolution Batch conflict resolution Group-based conflict resolution
HOSS Multi-scheduler Framework HOSS: All Schedulers Kubernetes Native Scheduler Firmament Scheduler Tarcil Scheduler Mesos Yarn ETCD HOSS Multi-scheduler Framework
Comparison: Multi-Schedulers in Mesos Framework 2 Scheduler tasks Framework 1 Scheduler tasks 2 <task 1, node1, 2cpu, 1gb, …> <task 2, node1, 1cpu, 2gb,…> 1 <node1, 4cpu, 4gb, …> Allocation Module Mesos Master <framework1, task 1, 2cpu, 1gb, …> < framework2, task 2, 1cpu, 2gb,…> 3 Executor Node 1 Task1 Task2 Executor Node 2 Task
Summary Kubernetes Mesos* HOSS Concurrency Control Optimistic Pessimistic Scheduler’s Resource View Shared global state Partial state Job Dispatch Multi-type N/A Multi-instance Dynamic policy Conflict Detection Late detection Early detection Conflict Model Coarse-grained model Fine-grained model * Without the ongoing improvement JIRA-1607 “Optimistic Offer”
References https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler.md https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler_algorithm.md https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/multiple-schedulers.md Firmament: http://www.firmament.io/blog/scheduler-architectures.html Tarcil: http://web.stanford.edu/~cdel/2014.insubmission.tarcil.pdf Omega: http://research.google.com/pubs/pub41684.html Borg, Omega and Kubernetes: http://research.google.com/pubs/pub44843.html Mesos optimistic offer: https://issues.apache.org/jira/browse/MESOS-1607 An interview about different multi-scheduler architectures: https://kismatic.com/company/qa-with- malte-schwarzkopf-on-distributed-systems-orchestration-in-the-modern-data-center/
Thank You