Lecture 13 Warehouse Computing Prof. Xiaoyao Liang 2015/6/10

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism: Computer.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
The University of Adelaide, School of Computer Science
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism: Computer.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
1 Lecture 20: WSC, Datacenters Topics: warehouse-scale computing and datacenters (Sections ) – the basics followed by a look at the future.
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Background Computer System Architectures Computer System Software.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
1 现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 通信邮箱: 提交作业邮箱: 2015 年.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
The University of Adelaide, School of Computer Science
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Lecture 20: WSC, Datacenters
MapReduce: Simplified Data Processing on Large Clusters
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 7 Physical Infrastructure of WSC Prof. Zhang Gang
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
湖南大学-信息科学与工程学院-计算机与科学系
Lecture 18 Warehouse Scale Computing
CS 345A Data Mining MapReduce This presentation has been altered.
Lecture 18 Warehouse Scale Computing
Lecture 18 Warehouse Scale Computing
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Lecture 13 Warehouse Computing Prof. Xiaoyao Liang 2015/6/10 MS108 Computer System I Lecture 13 Warehouse Computing Prof. Xiaoyao Liang 2015/6/10

Morgan Kaufmann Publishers 22 April, 2017 Server Computers Applications are increasingly run on servers Web search, office apps, virtual worlds, … Requires large data center servers Multiple processors, networks connections, massive storage Space and power constraints Rack server equipment often in units of 1.75” (1U). E.g., a 1U switch, a 2U server Chapter 6 — Storage and Other I/O Topics

Morgan Kaufmann Publishers 22 April, 2017 Rack-Mounted Servers Sun Fire x4150 1U server Chapter 6 — Storage and Other I/O Topics

Scalability Vs. Cost Cluster of PCs SMP Super Server Departmental MPP SMP Super Server Departmental Cluster of PCs Server Personal System

Motivations of using Clusters over Specialized Parallel Computers Individual PCs are becoming increasingly powerful Communication bandwidth between PCs is increasing and latency is decreasing (Gigabit Ethernet, Myrinet) PC clusters are easier to integrate into existing networks Typical low user utilization of PCs (<10%) Development tools for workstations and PCs are mature PC clusters are a cheap and readily available Clusters can be easily grown

Cluster Architecture Parallel Applications Parallel Applications Sequential Applications Sequential Applications Sequential Applications Parallel Programming Environment Cluster Middleware (Single System Image and Availability Infrastructure) PC/Workstation Network Interface Hardware Communications Software PC/Workstation Network Interface Hardware Communications Software PC/Workstation Network Interface Hardware Communications Software PC/Workstation Network Interface Hardware Communications Software Cluster Interconnection Network/Switch

How Can we Benefit From Clusters? Given a certain user application Phase 1 If the application can be run fast enough on a single PC, there is no need to do anything else Otherwise go to Phase 2 Phase 2 Try to put the whole application on the DRAM to avoid going to the disk. If that is not possible, use the DRAM of the other idle workstations Network DRAM is 5 to 10 times faster than local disk

Remote Memory Paging Background Solution Application’s working sets have increased dramatically Applications require more memory than a single workstation can provide. Solution Inserts the Network DRAM in the memory hierarchy between local memory and the disk Swaps the page to remote memory Cache Main Memory Disk Network RAM

How Can we Benefit From Clusters? In this case, the DRAM of the networked PCs behave like a huge cache system for the disk Otherwise go to Phase 3 Time 512 MB + Disk Networked DRAM All DRAM Problem size 512 MB

How Can we Benefit From Clusters? Phase 3 If the network DRAM is not large enough, try using all the disks in the network in parallel for reading and writing data and program code (e.g., RAID) to speedup the I/O Or, go to Phase 4 File Cache P Communication Network Network RAID striping Local Cluster Caching

How Can we Benefit From Clusters? Phase 4 Execute the program on a multiple number of workstations (PCs) at the same time – Parallel processing Tools There are many tools that do all these phases in a transparent way (except parallelizing the program) as well as load-balancing and scheduling. Beowulf (CalTech and NASA) - USA Condor - Wisconsin State University, USA MPI (MPI Forum, MPICH is one of the popular implementations) NOW (Network of Workstations) - Berkeley, USA PVM - Oak Ridge National Lab./UTK/Emory, USA

What network should be used? Fast Ethernet Gigabit Ethernet Myrinet 10GbE Latency ~120s ~120 s ~7 s 10s of s’s Bandwidth ~100Mbps peak ~1Gbps peak ~1.98Gbps real 10Gbps peak

2007 Top500 List Clusters are the fastest growing category of supercomputers in the TOP500 List. 406 clusters (81%) in November 2007 list 130 clusters (23%) in the June 2003 list 80 clusters (16%) in the June 2002 list 33 clusters (6.6%) in the June 2001 list 4% of the supercomputers in the November 2007 TOP500 list use Myrinet technology! 54% of the supercomputers in the November 2007 TOP500 list Gigabit Ethernet technology!

The University of Adelaide, School of Computer Science 22 April 2017 Introduction Warehouse-scale computer (WSC) Provides Internet services Search, social networking, online maps, video sharing, online shopping, email, cloud computing, etc. Differences with HPC “clusters”: Clusters have higher performance processors and network Clusters emphasize thread-level parallelism, WSCs emphasize request-level parallelism Differences with datacenters: Datacenters consolidate different machines and software into one location Datacenters emphasize virtual machines and hardware heterogeneity in order to serve varied customers Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 22 April 2017 Introduction Important design factors for WSC: Cost-performance Small savings add up Energy efficiency Affects power distribution and cooling Work per joule Dependability via redundancy Network I/O Interactive and batch processing workloads Ample computational parallelism is not important Most jobs are totally independent “Request-level parallelism” Operational costs count Power consumption is a primary, not secondary, constraint when designing system Scale and its opportunities and problems Can afford to build customized systems since WSC require volume purchase Chapter 2 — Instructions: Language of the Computer

Google 2. The web server sends the query to the Index Server cluster, which matches the query to documents. 1. The user enters a query on a web form sent to the Google web server. 4. The list, with abstracts, is displayed by the web server to the user, sorted (using a secret formula involving PageRank). 3. The match is sent to the Doc Server cluster, which retrieves the documents to generate abstracts and cached copies.

Google Requirements Google: search engine that scales at Internet growth rates Search engines: 24x7 availability Google : 600M queries/day, or AVERAGE of 7500 queries/s all day (old data) Google crawls WWW and puts up new index every 2 weeks (old data) Storage: 5.3 billion web pages, 950 million newsgroup messages, and 925 million images indexed, Millions of videos (very old data) Response time goal: < 0.5 s per search (old data)

Google (Based on old data) Require high amounts of computation per request A single query on Google (on average) reads hundreds of megabytes of data consumes tens of billions of CPU cycles A peak request stream on Google requires an infrastructure comparable in size to largest supercomputer installations Typical google Data center: 15000 PCs (Linux), 30000 disks: almost 3 petabyte! Google application affords easy parallelization Different queries can run on different processors A single query can use multiple processors because the overall index is partitioned

Prgrm’g Models and Workloads The University of Adelaide, School of Computer Science 22 April 2017 Prgrm’g Models and Workloads Batch processing framework: MapReduce Map: applies a programmer-supplied function to each logical input record Runs on thousands of computers Provides new set of key-value pairs as intermediate values Reduce: collapses values using another programmer-supplied function Chapter 2 — Instructions: Language of the Computer

Prgrm’g Models and Workloads The University of Adelaide, School of Computer Science 22 April 2017 Prgrm’g Models and Workloads Example: map (String key, String value): // key: document name // value: document contents for each word w in value EmitIntermediate(w,”1”); // Produce list of all words reduce (String key, Iterator values): // key: a word // value: a list of counts int result = 0; for each v in values: result += ParseInt(v); // get integer from key-value pair Emit(AsString(result)); Chapter 2 — Instructions: Language of the Computer

Distributed Word Count Split data Very big data merge merged count

Map+Reduce Map Reduce Accepts input key/value pair Emits intermediate key/value pair Reduce Accepts intermediate key/value* pair Emits output key/value pair R E D U C Partitioning Function Very big data M A P Result

Reverse Web-Link Map Reduce For each URL linking to target, … Output <target, source> pairs Reduce Concatenate list of all source URLs Outputs: <target, list (source)> pairs

Execution How is this distributed? Partition input key/value pairs into chunks, run map() tasks in parallel After all map()s are complete, consolidate all emitted values for each unique emitted key Now partition space of output map keys, and run reduce() in parallel If map() or reduce() fails, reexecute!

Architecture Master node user Job tracker Slave node 2 Slave node N Task tracker Task tracker Task tracker Workers Workers Workers

Task Granularity Fine granularity tasks: map tasks >> machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Often use 200,000 map & 5000 reduce tasks Running on 2000 machines

GFS Goal Master Node (meta server) Chunk server (data server) global view make huge files available in the face of node failures Master Node (meta server) Centralized, index all chunks on data servers Chunk server (data server) File is split into contiguous chunks, typically 16-64MB. Each chunk replicated (usually 2x or 3x). Try to keep replicas in different racks.

GFS … GFS Master Client C0 C1 C1 C0 C5 Chunkserver 1 Chunkserver 2 Chunkserver N C5 C2 C5 C3 C2

Execution

Workflow

Fault Tolerance Reactive way Worker failure Master failure Heartbeat, Workers are periodically pinged by master NO response = failed worker If the processor of a worker fails, the tasks of that worker are reassigned to another worker. Master failure Master writes periodic checkpoints Another master can be started from the last checkpointed state If eventually the master dies, the job will be aborted

Fault Tolerance Proactive way (Redundant Execution) The problem of “stragglers” (slow workers) Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!) When computation almost done, reschedule in-progress tasks Whenever either the primary or the backup executions finishes, mark it as completed

Fault Tolerance Input error: bad records Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix, but not always possible On segment fault Send UDP packet to master from signal handler Include sequence number of record being processed Skip bad records If master sees two failures for same record, next worker is told to skip the record

Prgrm’g Models and Workloads The University of Adelaide, School of Computer Science 22 April 2017 Prgrm’g Models and Workloads MapReduce runtime environment schedules map and reduce task to WSC nodes Availability: Use replicas of data across different servers Use relaxed consistency: No need for all replicas to always agree Workload demands Often vary considerably Chapter 2 — Instructions: Language of the Computer

Computer Architecture of WSC The University of Adelaide, School of Computer Science 22 April 2017 Computer Architecture of WSC WSC often use a hierarchy of networks for interconnection Each rack holds dozens of servers connected to a rack switch Rack switches are uplinked to switch higher in hierarchy Goal is to maximize locality of communication relative to the rack Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 22 April 2017 Storage Storage options: Use disks inside the servers, or Network attached storage through Infiniband WSCs generally rely on local disks Google File System (GFS) uses local disks and maintains at least three replicas Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 22 April 2017 Array Switch Switch that connects an array of racks Array switch should have 10 X the bisection bandwidth of rack switch Cost of n-port switch grows as n2 Often utilize content addressible memory chips and FPGAs Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 22 April 2017 WSC Memory Hierarchy Servers can access DRAM and disks on other servers using a NUMA-style interface Chapter 2 — Instructions: Language of the Computer

Infrastructure and Costs of WSC The University of Adelaide, School of Computer Science 22 April 2017 Infrastructure and Costs of WSC Location of WSC Proximity to Internet backbones, electricity cost, property tax rates, low risk from earthquakes, floods, and hurricanes Power distribution Chapter 2 — Instructions: Language of the Computer

Infrastructure and Costs of WSC The University of Adelaide, School of Computer Science 22 April 2017 Infrastructure and Costs of WSC Cooling Air conditioning used to cool server room 64 F – 71 F Keep temperature higher (closer to 71 F) Cooling towers can also be used Minimum temperature is “wet bulb temperature” Chapter 2 — Instructions: Language of the Computer

Infrastructure and Costs of WSC The University of Adelaide, School of Computer Science 22 April 2017 Infrastructure and Costs of WSC Cooling system also uses water (evaporation and spills) E.g. 70,000 to 200,000 gallons per day for an 8 MW facility Power cost breakdown: Chillers: 30-50% of the power used by the IT equipment Air conditioning: 10-20% of the IT power, mostly due to fans How man servers can a WSC support? Each server: “Nameplate power rating” gives maximum power consumption To get actual, measure power under actual workloads Oversubscribe cumulative server power by 40%, but monitor power closely Chapter 2 — Instructions: Language of the Computer

Measuring Efficiency of a WSC The University of Adelaide, School of Computer Science 22 April 2017 Measuring Efficiency of a WSC Power Utilization Effectiveness (PEU) = Total facility power / IT equipment power Median PUE on 2006 study was 1.69 Performance Latency is important metric because it is seen by users Bing study: users will use search less as response time increases Service Level Objectives (SLOs)/Service Level Agreements (SLAs) E.g. 99% of requests be below 100 ms Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 22 April 2017 Cost of a WSC Capital expenditures (CAPEX) Cost to build a WSC Operational expenditures (OPEX) Cost to operate a WSC Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 22 April 2017 Cloud Computing WSCs offer economies of scale that cannot be achieved with a datacenter: 5.7 times reduction in storage costs 7.1 times reduction in administrative costs 7.3 times reduction in networking costs This has given rise to cloud services such as Amazon Web Services “Utility Computing” Based on using open source virtual machine and operating system software Chapter 2 — Instructions: Language of the Computer