Datacenter As a Computer Mosharaf Chowdhury EECS 582 – W1613/7/16
Announcements Midterm grades are out There were many interesting approaches. Thanks! Meeting on April 11 moved earlier to April 8 (Friday) No reviews for the papers for April 8 Meeting on April 13 moved later to April 15 (Friday) 3/7/16EECS 582 – W162
Mid-Semester Presentations Strictly followed 20 minutes per group minutes Q/A Four parts: motivation, approach overview, current status, and end goal March 21 1.Juncheng and Youngmoon 2.Andrew 3.Dong-hyeon and Ofir 4.Hanyun, Ning, and Xianghan March 23 1.Clyde, Nathan, and Seth 2.Chao-han, Chi-fan, and Yayun 3.Chao and Yikai 4.Kuangyuan, Qi, and Yang 3/7/16EECS 582 – W163
Why is One Machine Not Enough? Too much data Too little storage capacity Not enough I/O bandwidth Not enough computing capability 3/7/16EECS 582 – W164
Warehouse-Scale Computers Single organization Homogeneity (to some extent) Cost efficiency at scale Multiplexing across applications and services Rent it out! Many concerns Infrastructure Networking Storage Software Power/Energy Failure/Recovery … 3/7/16EECS 582 – W165
Architectural Overview 3/7/16EECS 582 – W166 Memory Bus Ethernet SATA PCIe Server ToR Aggregation
Datacenter Networks Traditional hierarchical topology Expensive Difficult to scale High oversubscription Smaller path diversity … 3/7/16EECS 582 – W167 Core Agg. Edge
Datacenter Networks CLOS topology Cheaper Easier to scale NO/low oversubscription Higher path diversity … 3/7/16EECS 582 – W168 Core Agg. Edge
Storage Hierarchy 3/7/16EECS 582 – W169 ( L1 cache L2 cache L3 cache RAM 3D Xpoint SSD HDD Across machines, racks, and pods
Power, Energy, Modeling, Building,… Many challenges We’ll focus primarily on software infrastructure in this class 3/7/16EECS 582 – W1610
Datacenter Needs an Operating System Datacenter is a collection of CPU cores Memory modules SSDs and HDDs All connected by an interconnect A computer is a collection of CPU cores Memory modules SSDs and HDDs All connected by an interconnect 3/7/16EECS 582 – W1611
Some Differences 1.High-level of parallelism 2.Diversity of workload 3.Resource heterogeneity 4.Failure is the norm 5.Communication dictates performance 3/7/16EECS 582 – W1612
Three Categories of Software 1.Platform-level Software firmware that are present in every machine 2.Cluster-level Distributed systems to enable everything 3.Application-level User-facing applications built on top 3/7/16EECS 582 – W1613
Common Techniques TechniquePerformanceAvailability Replication Erasure coding Sharding/partitioning Load balancing Health checks Integrity checks Compression Eventual consistency Centralized controller Canaries Redundant execution 3/7/16EECS 582 – W1614
Common Techniques TechniquePerformanceAvailability ReplicationXX Erasure codingX Sharding/partitioningXX Load balancingX Health checksX Integrity checksX CompressionX Eventual consistencyXX Centralized controllerX CanariesX Redundant executionX 3/7/16EECS 582 – W1615
Datacenter Programming Models Fault-tolerance, scalable, and easy access to all the distributed datacenter resources Users submit jobs to these models w/o having to worry about low-level details MapReduce Grandfather of big data as we know today Two-stage, disk-based, network-avoiding Spark Common substrate for diverse programming requirements Many-stage, memory-first 3/7/16EECS 582 – W1616
Datacenter “Operating Systems” Fair and efficient distribution of resources among many competing programming models and jobs Does the dirty work so that users won’t have to Mesos Started with a simple question – how to run different versions of Hadoop? Fairness-first allocator Borg Google’s cluster manager Utilization-first allocator 3/7/16EECS 582 – W1617
Resource Allocation and Scheduling How do we divide the resources anyway? DRF Multi-resource max-min fairness Two-level; implemented in Mesos and YARN HUG: DRF + High utilization Omega Shared-state resource allocator Many schedulers interact through transactions 3/7/16EECS 582 – W1618
File Systems Fault-tolerant, efficient access to data GFS Data resides with compute resources Compute goes to data; hence, data locality The game changer: centralization isn’t too bad! FDS Data resides separately from compute Data comes to compute; hence, requires very fast network 3/7/16EECS 582 – W1619
Memory Management What to store in cache and what to evict? PACMan Disk locality is irrelevant for fast-enough network All-or-nothing property: caching is useless unless all tasks’ inputs are cached Best eviction algorithm for single machine isn’t so good for parallel computing Parameter Server Shared-memory architecture (sort of) Data and compute are still collocated, but communication is automatically batched to minimize overheads 3/7/16EECS 582 – W1620
Network Scheduling Communication cannot be avoided; how do we minimize its impact? DCTCP Application-agnostic; point-to-point Outperforms TCP through ECN-enabled multi-level congestion notifications Varys Application-aware; multipoint-to-multipoint; all-or-nothing in communication Concurrent open-shop scheduling with coupled resources Centralized network bandwidth management 3/7/16EECS 582 – W1621
Unavailability and Failure In a server DC, with day MTBF machines, one machine will fail everyday on average Build fault-tolerant software infrastructure and hide failure- handling complexity from application-level software as much as possible Configuration is one of the largest sources of service disruption Storage subsystems are the biggest sources of machine crashes Tolerating/surviving from failures is different from hiding failures 3/7/16EECS 582 – W1622
What’s the most critical resource in a datacenter? Why? 3/7/16EECS 582 – W1623
Will we come back to client-centric models? As opposed to server-centric/datacenter-driven model today If yes, why and when? If not, why not? 3/7/16EECS 582 – W1624