The Google Cluster Architecture

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

Lecture 6: Multicore Systems

1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.

REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

2/25/2004 The Google Cluster Architecture February 25, 2004.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:

Memory Management 2010.

Chapter 17 Parallel Processing.

Memory Organization.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Modified from Silberschatz, Galvin and Gagne ©2009 CS 446/646 Principles of Operating Systems Lecture 1 Chapter 1: Introduction.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

DISTRIBUTED COMPUTING

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

PMIT-6102 Advanced Database Systems

Redundant Array of Independent Disks

Computer System Architectures Computer System Software

1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.

Windows 2000 Advanced Server and Clustering Prepared by: Tetsu Nagayama Russ Smith Dale Pena.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.

Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.

Multi-core architectures. Single-core computer Single-core CPU chip.

Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.

Multi-Core Architectures

1 Web Server Administration Chapter 2 Preparing For Server Installation.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Web Search Using Mobile Cores Presented by: Luwa Matthews 0.

Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper

"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

CS211 - Fernandez - 1 CS211 Graduate Computer Architecture Network 3: Clusters, Examples.

PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.

EEL 5708 Cluster computers. Case study: Google Lotzi Bölöni.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Full and Para Virtualization

Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Computer performance issues* Pipelines, Parallelism. Process and Threads.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Background Computer System Architectures Computer System Software.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

The Google Cluster Architecture Written By: Luiz André Barroso Jeffrey Dean Urs Hölzle Presented By: Omkar Kasinadhuni Simerjeet Kaur.

Unit 2 VIRTUALISATION. Unit 2 - Syllabus Basics of Virtualization Types of Virtualization Implementation Levels of Virtualization Virtualization Structures.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Cloud Computing Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington August 2012.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Memory COMPUTER ARCHITECTURE

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

CLUSTER COMPUTING Presented By, Navaneeth.C.Mouly 1AY05IS037

CHAPTER 3 Architectures for Distributed Systems

Hyperthreading Technology

Chapter 8 Digital Design and Computer Architecture: ARM® Edition

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Web Server Administration

CLUSTER COMPUTING.

Computer Evolution and Performance

Database System Architectures

Presentation transcript:

The Google Cluster Architecture Presented by Fatma Canan Pembe 2004800193

PURPOSE To overview the computer architecture of Google One of the mostly known and used search engines today How it can achieve such a processing power under such big workload

OUTLINE Introduction Cluster architectures Google architecture overview Serving a Google query Design principles of Google clusters Leveraging commodity parts Power problem Hardware-level characteristics Memory system Summary

INTRODUCTION Search engines A single query on Google (on average) require high amounts of computation per request A single query on Google (on average) reads hundreds of megabytes of data consumes tens of billions of CPU cycles A peak request stream on Google Thousands of queries per second requires an infrastructure comparable in size to largest supercomputer installations

INTRODUCTION (Cont.) Google Combines more than 15,000 commodity-class PCs Instead of a smaller number of high-end servers Most important factors that influenced the design Energy efficiency Price-performance ratio Google application affords easy parallelization Different queries can run on different processors A single query can use multiple processors because the overall index is partitioned

CLUSTER ARCHITECTURES collection of independent computers using switched network to provide a common service Many mainframe applications run more "loosely coupled" machines than shared memory machines databases, file servers, Web servers, simulations, etc. Often need to be highly available, requiring error tolerance and repairability Often need to scale

DISADVANTAGES OF CLUSTERS Cost of administering a cluster of N machines administering N independent machines vs. cost of administering a shared address space N processors multiprocessor administering 1 big machine Clusters usually connected using I/O bus whereas multiprocessors usually connected on memory bus Cluster of N machines has N independent memories and N copies of OS but a shared address multi-processor allows 1 program to use almost all memory

ADVANTAGES OF CLUSTERS Error isolation separate address space limits contamination of error Repair Easier to replace a machine without bringing down the system than in a shared memory multiprocessor Scale easier to expand the system without bringing down the application that runs on top of the cluster Cost Large scale machine has low volume => fewer machines to spread development costs vs. leverage high volume off-the-shelf switches and computers Amazon, AOL, Google, Hotmail, and Yahoo rely on clusters of PCs to provide services used by millions of people every day

GOOGLE ARCHITECTURE OVERVIEW Reliability provided in software level rather than in server-class hardware so that commodity PCs can be used to build a cluster at a low price Design for best aggregate throughput rather than peak server response time Building a reliable computing infrastructure from clusters of unreliable commodity PCs

SERVING A GOOGLE QUERY When user enters a query e.g. www.google.com/search?q=ieee+society User browser Domain Name System (DNS) lookup to map to a particular IP address Multiple Google clusters distributed worldwide each cluster with a few thousand machines to handle query traffic

SERVING A GOOGLE QUERY (Cont.) Geographically distributed setup protects against catastrophic failures DNS-based load-balancing system selects a cluster according to user’s geographic proximity available capacity at various clusters User’s browser sends HTTP request to one of the clusters thereafter, processing local to that cluster

SERVING A GOOGLE QUERY (Cont.) A hardware based load balancer in each cluster monitors available Google Web Servers (GWSs) performs local load balancing of requests A GWS machine coordinates the query execution returns results as HTML response

SERVING A GOOGLE QUERY (Cont.)

SERVING A GOOGLE QUERY (Cont.) Query execution phases 1. The index servers determine the relevant documents by consulting an inverted index challenging due to large amount of data Raw documents -> several tens of terabytes of data Inverted index -> many terabytes of data Fortunately, search is highly parallelizable by dividing the index into pieces (index shards) For each shard, a pool of machines serve improving reliability a load balancer employed 2. The document servers determine the actual URLs and query-specific summaries of the found documents Again documents are divided into shards

DESIGN PRINCIPLES OF GOOGLE CLUSTERS Software level reliability No fault-tolerant hardware features; e.g. redundant power supplies A redundant array of inexpensive disks (RAID) instead tolerate failures in software Use replication for better request throughput and availability Price/performance beats peak performance CPUs giving the best performance per unit price Not the CPUs with best absolute performance Using commodity PCs reduces the cost of computation

FIRST GOOGLE SERVER RACK In the Computer History Museum (from year 1999) Each tray contains eight 22GB hard drives and one power supply

LEVERAGING COMMODITY PARTS Google’s racks consist of 40 to 80 x86-based servers Server components similar to mid-range desktop PC except for larger disk drives Ranging from single processor 533-MHz Intel-Celeron based servers to dual 1.4-GHz Intel Pentium III servers Servers on each rack interconnennected via 100 Mbps Ethernet All racks interconnectd via a gigabit switch

LEVERAGING COMMODITY PARTS (Cont.) Selection criterion Cost per query [Capital expense (with depreciation) + operating costs (hosting, system administration, repairs)] / performance inexpensive PC-based clusters vs high-end multiprocessor servers Rack -> 176 2-GHz Xeon CPUs + 176 Gbytes RAM + 7 Tbytes of disk space = $278,000 Server -> 8 2-GHz Xeon CPUs + 64 Gbytes RAM + 8 Tbytes of disk space = $758,000

LEVERAGING COMMODITY PARTS (Cont.) Multiprocessor server about 3 times more expensive 22 times fewer CPUs 3 times less RAM Cost difference of high-end server due to higher interconnect bandwidth reliability which are not necessary in Google’s highly redundant architecture

THE POWER PROBLEM A mid-range server with dual 1.4-GHz Pentium III processors 90 W of DC power 55 W for the two CPUs 10 W for a disk drive 25 W for DRAM and motherboard Typical efficiency of an ATX power supply -> 75% means 120 W of AC power per server roughly 10 kW per rack

THE POWER PROBLEM (Cont.) A rack 25 ft2 of space Corresponding power density: 400 W/ ft2 With higher-end processors: 700 W/ft2 Typical power density for commercial data centers: between 70 and 150 W/ft2 Much lower than that required for PC clusters Special cooling or additional space required to decrease power density to a tolerable level

THE POWER PROBLEM (Cont.) Reduced-power servers can be used; but must be without a performance penalty must not be considerably more expensive

HARDWARE-LEVEL CHARACTERISTICS Architectural characteristics of the Google query-serving application examined to determine hardware platforms for best price/performance Index server most heavily impacts the overall price/performance

INSTRUCTION LEVEL MEASUREMENTS ON THE INDEX SERVER Characteristic (On a 1-GHz dual-processor Pentium III system) Value Cycles per instruction 1.1 Ratios (percentage) Branch mispredict Level 1 instruction miss* Level 1 data miss* Level 2 miss* Instruction TLB miss* Data TLB miss* * Cache and TLB ratios are per instructions retired 5.0 0.4 0.7 0.3 0.04

HARDWARE-LEVEL CHARACTERISTICS Moderately high CPI Pentium III capable of issuing 3 instructions/cycle Reason: a significant number of difficult-to-predict branches traversing of dynamic data structures data dependent control flow in newer Pentium 4 processor Same workload CPI is nearly twice Approximately the same branch prediction performance Even though Pentium 4 can issue more instructions concurrently has superior branch prediction logic Google workload does not much contain exploitable instruction-level parallelism (ILP)

HARDWARE-LEVEL CHARACTERISTICS (Cont.) To exploit parallelism: Trivially parallelizable computation in processing of queries requires little communication already done using large number of inexpensive nodes at the cluster level Thread-level parallelism at the microarchitecture level Simultaneous multithreading (SMT) systems Chip multiprocessor (CMP) systems

HARDWARE-LEVEL CHARACTERISTICS (Cont.) Simultaneous multithreading (SMT) Experiments with a dual-context (SMT) Intel Xeon processor more than 30% performance improvement over a single-context setup at the upper bound of improvements reported by Intel for their SMT implementation

HARDWARE-LEVEL CHARACTERISTICS (Cont.) Chip multiprocessor (CMP) architectures such as Hydra and Piranha Multiple (four to eight) simpler, in-order, short-pipeline cores to replace a complex high-performance core Penalties of in-order execution minor because of little ILP in the Google application Shorter pipelines reduce/eliminate branch mispredict penalties Available thread level parallelism can allow near-linear speedup With the number of cores A shared L2 cache of reasonable size can speed up inter-processor communication

MEMORY SYSTEM Table Main memory system performance parameters Good performance for the instruction cache & instruction translation look-aside buffer due to relatively small inner-loop code size Index data blocks No temporal locality due to size of data and unpredictability in access patterns Benefit from spatial locality Hardware prefetching or larger cache lines can be used Good overall cache hit ratios (even for relatively modest cache sizes)

INSTRUCTION LEVEL MEASUREMENTS ON THE INDEX SERVER Characteristic (On a 1-GHz dual processor Pentium III system) Value Cycles per instruction 1.1 Ratios (percentage) Branch mispredict Level 1 instruction miss* Level 1 data miss* Level 2 miss* Instruction TBL miss* Data TBL miss* * Cache and TLB ratios are per instructions retired 5.0 0.4 0.7 0.3 0.04

MEMORY SYSTEM (Cont.) Memory bandwidth does not appear to be a bottleneck A suitable memory system for the load a relatively modest sized L2 cache short L2 cache and memory latencies longer (perhaps 128 byte) cache lines

SUMMARY Google infrastructure Massively large cluster of inexpensive machines vs a smaller number of large-scale shared memory machines Useful when computation-to-communication ratio is low Communication patterns or data partitioning are dynamic or hard to predict Total cost of ownership is much greater than hardware costs (due to management overhead and software licensing prices) in this cases, they justify their high prices None of these requirements apply at Google

SUMMARY (Cont.) Google Partitions index data and computation to minimize communication to evenly balance the load across servers Produces its software in-house Minimizes system management overhead through extensive automation and monitoring Hardware costs become important Deploys many small multiprocessors Faults effect smaller pieces of the system vs large-scale shared-memory machines which do not handle individual hardware component or software failures enough Most fault types causing a full system crash

SUMMARY (Cont.) It appears there are few applications like Google requiring many thousands of servers and petabytes of storage However, many applications share the characteristics of Focusing on price/performance Ability to run on servers without private state (so servers can be replicated) allowing a PC-based cluster architecture e.g. high-volume Web servers, application servers that are computationally intensive but essentially stateless

SUMMARY (Cont.) At Google’s scale Some limits of massive server parallelism become apparent; e.g.: Limited cooling capacity of commercial data centers Less-than-optimal fit of current CPUs for throughput-oriented applications Nevertheless, using inexpensive PCs increased the amount of computation that can be afforded to spend per query increasing the amount of computation that can be afforded to spend per query thus, helping to improve the search experience of the users

THANK YOU