Presentation is loading. Please wait.

Presentation is loading. Please wait.

Datacenter As a Computer Mosharaf Chowdhury EECS 582 – W1613/7/16.

Similar presentations


Presentation on theme: "Datacenter As a Computer Mosharaf Chowdhury EECS 582 – W1613/7/16."— Presentation transcript:

1 Datacenter As a Computer Mosharaf Chowdhury EECS 582 – W1613/7/16

2 Announcements Midterm grades are out There were many interesting approaches. Thanks! Meeting on April 11 moved earlier to April 8 (Friday) No reviews for the papers for April 8 Meeting on April 13 moved later to April 15 (Friday) 3/7/16EECS 582 – W162

3 Mid-Semester Presentations Strictly followed 20 minutes per group 15 + 5 minutes Q/A Four parts: motivation, approach overview, current status, and end goal March 21 1.Juncheng and Youngmoon 2.Andrew 3.Dong-hyeon and Ofir 4.Hanyun, Ning, and Xianghan March 23 1.Clyde, Nathan, and Seth 2.Chao-han, Chi-fan, and Yayun 3.Chao and Yikai 4.Kuangyuan, Qi, and Yang 3/7/16EECS 582 – W163

4 Why is One Machine Not Enough? Too much data Too little storage capacity Not enough I/O bandwidth Not enough computing capability 3/7/16EECS 582 – W164

5 Warehouse-Scale Computers Single organization Homogeneity (to some extent) Cost efficiency at scale Multiplexing across applications and services Rent it out! Many concerns Infrastructure Networking Storage Software Power/Energy Failure/Recovery … 3/7/16EECS 582 – W165

6 Architectural Overview 3/7/16EECS 582 – W166 Memory Bus Ethernet SATA PCIe Server ToR Aggregation

7 Datacenter Networks Traditional hierarchical topology Expensive Difficult to scale High oversubscription Smaller path diversity … 3/7/16EECS 582 – W167 Core Agg. Edge

8 Datacenter Networks CLOS topology Cheaper Easier to scale NO/low oversubscription Higher path diversity … 3/7/16EECS 582 – W168 Core Agg. Edge

9 Storage Hierarchy 3/7/16EECS 582 – W169 (https://www.youtube.com/watch?v=IWsjbqbkqh8) L1 cache L2 cache L3 cache RAM 3D Xpoint SSD HDD Across machines, racks, and pods

10 Power, Energy, Modeling, Building,… Many challenges We’ll focus primarily on software infrastructure in this class 3/7/16EECS 582 – W1610

11 Datacenter Needs an Operating System Datacenter is a collection of CPU cores Memory modules SSDs and HDDs All connected by an interconnect A computer is a collection of CPU cores Memory modules SSDs and HDDs All connected by an interconnect 3/7/16EECS 582 – W1611

12 Some Differences 1.High-level of parallelism 2.Diversity of workload 3.Resource heterogeneity 4.Failure is the norm 5.Communication dictates performance 3/7/16EECS 582 – W1612

13 Three Categories of Software 1.Platform-level Software firmware that are present in every machine 2.Cluster-level Distributed systems to enable everything 3.Application-level User-facing applications built on top 3/7/16EECS 582 – W1613

14 Common Techniques TechniquePerformanceAvailability Replication Erasure coding Sharding/partitioning Load balancing Health checks Integrity checks Compression Eventual consistency Centralized controller Canaries Redundant execution 3/7/16EECS 582 – W1614

15 Common Techniques TechniquePerformanceAvailability ReplicationXX Erasure codingX Sharding/partitioningXX Load balancingX Health checksX Integrity checksX CompressionX Eventual consistencyXX Centralized controllerX CanariesX Redundant executionX 3/7/16EECS 582 – W1615

16 Datacenter Programming Models Fault-tolerance, scalable, and easy access to all the distributed datacenter resources Users submit jobs to these models w/o having to worry about low-level details MapReduce Grandfather of big data as we know today Two-stage, disk-based, network-avoiding Spark Common substrate for diverse programming requirements Many-stage, memory-first 3/7/16EECS 582 – W1616

17 Datacenter “Operating Systems” Fair and efficient distribution of resources among many competing programming models and jobs Does the dirty work so that users won’t have to Mesos Started with a simple question – how to run different versions of Hadoop? Fairness-first allocator Borg Google’s cluster manager Utilization-first allocator 3/7/16EECS 582 – W1617

18 Resource Allocation and Scheduling How do we divide the resources anyway? DRF Multi-resource max-min fairness Two-level; implemented in Mesos and YARN HUG: DRF + High utilization Omega Shared-state resource allocator Many schedulers interact through transactions 3/7/16EECS 582 – W1618

19 File Systems Fault-tolerant, efficient access to data GFS Data resides with compute resources Compute goes to data; hence, data locality The game changer: centralization isn’t too bad! FDS Data resides separately from compute Data comes to compute; hence, requires very fast network 3/7/16EECS 582 – W1619

20 Memory Management What to store in cache and what to evict? PACMan Disk locality is irrelevant for fast-enough network All-or-nothing property: caching is useless unless all tasks’ inputs are cached Best eviction algorithm for single machine isn’t so good for parallel computing Parameter Server Shared-memory architecture (sort of) Data and compute are still collocated, but communication is automatically batched to minimize overheads 3/7/16EECS 582 – W1620

21 Network Scheduling Communication cannot be avoided; how do we minimize its impact? DCTCP Application-agnostic; point-to-point Outperforms TCP through ECN-enabled multi-level congestion notifications Varys Application-aware; multipoint-to-multipoint; all-or-nothing in communication Concurrent open-shop scheduling with coupled resources Centralized network bandwidth management 3/7/16EECS 582 – W1621

22 Unavailability and Failure In a 10000-server DC, with 10000-day MTBF machines, one machine will fail everyday on average Build fault-tolerant software infrastructure and hide failure- handling complexity from application-level software as much as possible Configuration is one of the largest sources of service disruption Storage subsystems are the biggest sources of machine crashes Tolerating/surviving from failures is different from hiding failures 3/7/16EECS 582 – W1622

23 What’s the most critical resource in a datacenter? Why? 3/7/16EECS 582 – W1623

24 Will we come back to client-centric models? As opposed to server-centric/datacenter-driven model today If yes, why and when? If not, why not? 3/7/16EECS 582 – W1624


Download ppt "Datacenter As a Computer Mosharaf Chowdhury EECS 582 – W1613/7/16."

Similar presentations


Ads by Google