Datacenter As a Computer Mosharaf Chowdhury EECS 582 – W1613/7/16.

Slides:



Advertisements
Similar presentations
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Advertisements

2. Computer Clusters for Scalable Parallel Computing
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Spark: Cluster Computing with Working Sets
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Chapter 9 Designing Systems for Diverse Environments.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Chapter 1 and 2 Computer System and Operating System Overview
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Chapter 1 and 2 Computer System and Operating System Overview
AN INTRODUCTION TO CLOUD COMPUTING Web, as a Platform…
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Shilpa Seth.  Centralized System Centralized System  Client Server System Client Server System  Parallel System Parallel System.
PMIT-6102 Advanced Database Systems
1 The Google File System Reporter: You-Wei Zhang.
Computer System Architectures Computer System Software
1 Advanced Storage Technologies for High Performance Computing Sorin, Faibish EMC NAS Senior Technologist IDC HPC User Forum, April 14-16, Norfolk, VA.
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
1 Lecture 20: WSC, Datacenters Topics: warehouse-scale computing and datacenters (Sections ) – the basics followed by a look at the future.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Storage in Big Data Systems
Transparency in Distributed Operating Systems Vijay Akkineni.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Distributed Computing Systems CSCI 4780/6780. Geographical Scalability Challenges Synchronous communication –Waiting for a reply does not scale well!!
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CIS250 OPERATING SYSTEMS Chapter One Introduction.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
Distributed Computing Systems CSCI 6900/4900. Review Distributed system –A collection of independent computers that appears to its users as a single coherent.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.
Background Computer System Architectures Computer System Software.
Primitive Concepts of Distributed Systems Chapter 1.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Data Centers and Cloud Computing 1. 2 Data Centers 3.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
BIG DATA/ Hadoop Interview Questions.
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI 11’ Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
BD-Cache: Big Data Caching for Datacenters
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Presented by Haoran Wang
Grid Computing.
PA an Coordinated Memory Caching for Parallel Jobs
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Datacenter As a Computer
EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.
湖南大学-信息科学与工程学院-计算机与科学系
Operating System Introduction.
Database System Architectures
Presentation transcript:

Datacenter As a Computer Mosharaf Chowdhury EECS 582 – W1613/7/16

Announcements Midterm grades are out There were many interesting approaches. Thanks! Meeting on April 11 moved earlier to April 8 (Friday) No reviews for the papers for April 8 Meeting on April 13 moved later to April 15 (Friday) 3/7/16EECS 582 – W162

Mid-Semester Presentations Strictly followed 20 minutes per group minutes Q/A Four parts: motivation, approach overview, current status, and end goal March 21 1.Juncheng and Youngmoon 2.Andrew 3.Dong-hyeon and Ofir 4.Hanyun, Ning, and Xianghan March 23 1.Clyde, Nathan, and Seth 2.Chao-han, Chi-fan, and Yayun 3.Chao and Yikai 4.Kuangyuan, Qi, and Yang 3/7/16EECS 582 – W163

Why is One Machine Not Enough? Too much data Too little storage capacity Not enough I/O bandwidth Not enough computing capability 3/7/16EECS 582 – W164

Warehouse-Scale Computers Single organization Homogeneity (to some extent) Cost efficiency at scale Multiplexing across applications and services Rent it out! Many concerns Infrastructure Networking Storage Software Power/Energy Failure/Recovery … 3/7/16EECS 582 – W165

Architectural Overview 3/7/16EECS 582 – W166 Memory Bus Ethernet SATA PCIe Server ToR Aggregation

Datacenter Networks Traditional hierarchical topology Expensive Difficult to scale High oversubscription Smaller path diversity … 3/7/16EECS 582 – W167 Core Agg. Edge

Datacenter Networks CLOS topology Cheaper Easier to scale NO/low oversubscription Higher path diversity … 3/7/16EECS 582 – W168 Core Agg. Edge

Storage Hierarchy 3/7/16EECS 582 – W169 ( L1 cache L2 cache L3 cache RAM 3D Xpoint SSD HDD Across machines, racks, and pods

Power, Energy, Modeling, Building,… Many challenges We’ll focus primarily on software infrastructure in this class 3/7/16EECS 582 – W1610

Datacenter Needs an Operating System Datacenter is a collection of CPU cores Memory modules SSDs and HDDs All connected by an interconnect A computer is a collection of CPU cores Memory modules SSDs and HDDs All connected by an interconnect 3/7/16EECS 582 – W1611

Some Differences 1.High-level of parallelism 2.Diversity of workload 3.Resource heterogeneity 4.Failure is the norm 5.Communication dictates performance 3/7/16EECS 582 – W1612

Three Categories of Software 1.Platform-level Software firmware that are present in every machine 2.Cluster-level Distributed systems to enable everything 3.Application-level User-facing applications built on top 3/7/16EECS 582 – W1613

Common Techniques TechniquePerformanceAvailability Replication Erasure coding Sharding/partitioning Load balancing Health checks Integrity checks Compression Eventual consistency Centralized controller Canaries Redundant execution 3/7/16EECS 582 – W1614

Common Techniques TechniquePerformanceAvailability ReplicationXX Erasure codingX Sharding/partitioningXX Load balancingX Health checksX Integrity checksX CompressionX Eventual consistencyXX Centralized controllerX CanariesX Redundant executionX 3/7/16EECS 582 – W1615

Datacenter Programming Models Fault-tolerance, scalable, and easy access to all the distributed datacenter resources Users submit jobs to these models w/o having to worry about low-level details MapReduce Grandfather of big data as we know today Two-stage, disk-based, network-avoiding Spark Common substrate for diverse programming requirements Many-stage, memory-first 3/7/16EECS 582 – W1616

Datacenter “Operating Systems” Fair and efficient distribution of resources among many competing programming models and jobs Does the dirty work so that users won’t have to Mesos Started with a simple question – how to run different versions of Hadoop? Fairness-first allocator Borg Google’s cluster manager Utilization-first allocator 3/7/16EECS 582 – W1617

Resource Allocation and Scheduling How do we divide the resources anyway? DRF Multi-resource max-min fairness Two-level; implemented in Mesos and YARN HUG: DRF + High utilization Omega Shared-state resource allocator Many schedulers interact through transactions 3/7/16EECS 582 – W1618

File Systems Fault-tolerant, efficient access to data GFS Data resides with compute resources Compute goes to data; hence, data locality The game changer: centralization isn’t too bad! FDS Data resides separately from compute Data comes to compute; hence, requires very fast network 3/7/16EECS 582 – W1619

Memory Management What to store in cache and what to evict? PACMan Disk locality is irrelevant for fast-enough network All-or-nothing property: caching is useless unless all tasks’ inputs are cached Best eviction algorithm for single machine isn’t so good for parallel computing Parameter Server Shared-memory architecture (sort of) Data and compute are still collocated, but communication is automatically batched to minimize overheads 3/7/16EECS 582 – W1620

Network Scheduling Communication cannot be avoided; how do we minimize its impact? DCTCP Application-agnostic; point-to-point Outperforms TCP through ECN-enabled multi-level congestion notifications Varys Application-aware; multipoint-to-multipoint; all-or-nothing in communication Concurrent open-shop scheduling with coupled resources Centralized network bandwidth management 3/7/16EECS 582 – W1621

Unavailability and Failure In a server DC, with day MTBF machines, one machine will fail everyday on average Build fault-tolerant software infrastructure and hide failure- handling complexity from application-level software as much as possible Configuration is one of the largest sources of service disruption Storage subsystems are the biggest sources of machine crashes Tolerating/surviving from failures is different from hiding failures 3/7/16EECS 582 – W1622

What’s the most critical resource in a datacenter? Why? 3/7/16EECS 582 – W1623

Will we come back to client-centric models? As opposed to server-centric/datacenter-driven model today If yes, why and when? If not, why not? 3/7/16EECS 582 – W1624