1 Lessons from Giant-Scale Services IEEE Internet Computing, Vol. 5, No. 4., July/August 2001 Eric A. Brewer University of California, Berkeley, and Iktomi.

Slides:

Advertisements

Similar presentations

Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.

Advertisements

Distributed Data Processing

© 2007 Cisco Systems, Inc. All rights reserved.ICND1 v1.0—1-1 Building a Simple Network Exploring the Functions of Networking.

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.

NETWORK LOAD BALANCING NLB.  Network Load Balancing (NLB) is a Clustering Technology.  Windows Based. (windows server).  To scale performance, Network.

1 Content Delivery Networks iBAND2 May 24, 1999 Dave Farber CTO Sandpiper Networks, Inc.

Peer-to-Peer Networks as a Distribution and Publishing Model Jorn De Boever (june 14, 2007)

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

CSE 190: Internet E-Commerce Lecture 16: Performance.

CCC/WNUG Exchange Update May 5, 2005 Nate Wilken Web and Messaging Applications Information Technology Arizona State University.

SERVER LOAD BALANCING Presented By : Priya Palanivelu.

Squirrel: A decentralized peer- to-peer web cache Paul Burstein 10/27/2003.

70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 7: Planning a DNS Strategy.

CSE 124 Networked Services Fall 2009 B. S. Manoj, Ph.D 10/29/20091CSE 124 Networked Services Fall 2009 Some.

11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.

Algorithms for Self-Organization and Adaptive Service Placement in Dynamic Distributed Systems Artur Andrzejak, Sven Graupner,Vadim Kotov, Holger Trinks.

Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.

Internet GIS. A vast network connecting computers throughout the world Computers on the Internet are physically connected Computers on the Internet use.

1 Content Distribution Networks. 2 Replication Issues Request distribution: how to transparently distribute requests for content among replication servers.

Lecture #8 Giant-Scale Services CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.

Barracuda Load Balancer Server Availability and Scalability.

Chapter 7: Using Windows Servers to Share Information.

Version 4.0. Objectives Describe how networks impact our daily lives. Describe the role of data networking in the human network. Identify the key components.

Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS Bill Devlin, Jim Cray, Bill Laing, George Spix Microsoft Research Dec

INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.

1 Chapter 6: Proxy Server in Internet and Intranet Designs Designs That Include Proxy Server Essential Proxy Server Design Concepts Data Protection in.

IT Infrastructure Chap 1: Definition

© 2006 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Identifying Application Impacts on Network Design Designing and Supporting Computer.

Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.

Overcast: Reliable Multicasting with an Overlay Network CS294 Paul Burstein 9/15/2003.

M.A.Doman Short video intro Model for enabling the delivery of computing as a SERVICE.

Cloud Computing Characteristics A service provided by large internet-based specialised data centres that offers storage, processing and computer resources.

Module 4: Planning, Optimizing, and Troubleshooting DHCP

1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.

Hosted by Why You Need a Storage Management Organization Ray Paquet Vice President & Research Director Gartner.

How to create DNS rule that allow internal network clients DNS access Right click on Firewall Policy ->New- >Access Rule Right click on Firewall.

Module 13 Implementing Business Continuity. Module Overview Protecting and Recovering Content Working with Backup and Restore for Disaster Recovery Implementing.

The Client/Server Database Environment Ployphan Sornsuwit KPRU Ref.

Kiew-Hong Chua a.k.a Francis Computer Network Presentation 12/5/00.

FireProof. The Challenge Firewall - the challenge Network security devices Critical gateway to your network Constant service The Challenge.

2.1 © 2004 Pearson Education, Inc. Exam Designing a Microsoft ® Windows ® Server 2003 Active Directory and Network Infrastructure Lesson 2: Examining.

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.

Content Delivery Networks: Status and Trends Speaker: Shao-Fen Chou Advisor: Dr. Ho-Ting Wu 5/8/

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Tackling I/O Issues 1 David Race 16 March 2010.

(re)-Architecting cloud applications on the windows Azure platform CLAEYS Kurt Technology Solution Professional Microsoft EMEA.

By Harshal Ghule Guided by Mrs. Anita Mahajan G.H.Raisoni Institute Of Engineering And Technology.

SMOOTHWALL FIREWALL By Nitheish Kumarr. INTRODUCTION  Smooth wall Express is a Linux based firewall produced by the Smooth wall Open Source Project Team.

Retele de senzori Curs 1 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.

System Models Advanced Operating Systems Nael Abu-halaweh.

Chapter 7: Using Windows Servers

Lab A: Planning an Installation

CONNECTING TO THE INTERNET

Network Operating Systems (NOS)

Large Distributed Systems

Maximum Availability Architecture Enterprise Technology Centre.

GlassFish in the Real World

LECTURE 34: WEB PROGRAMMING FOR SCALE

Scaling for the Future Katherine Yelick U.C. Berkeley, EECS

Introduction to Cloud Computing

Migration Strategies – Business Desktop Deployment (BDD) Overview

LECTURE 32: WEB PROGRAMMING FOR SCALE

LECTURE 33: WEB PROGRAMMING FOR SCALE

Specialized Cloud Architectures

Database System Architectures

LECTURE 33: WEB PROGRAMMING FOR SCALE

Presentation transcript:

1 Lessons from Giant-Scale Services IEEE Internet Computing, Vol. 5, No. 4., July/August 2001 Eric A. Brewer University of California, Berkeley, and Iktomi Corporation Παρουσίαση: Ηλίας Τσιγαρίδας (Μ484)

2 Examples of Giant-scale services Aol Microsoft network Yahoo eBay CNN Instant messaging Napster Many more… The demand They must be always available, despite their scale, growth rate, rapid evolution of content and features, etc

3 Article Characteristics Characteristics “Experience” article No literature points Principles approaches Not quantitative evaluation The reasons Focusing on high level design New area Proprietary nature of the information

4 Article scope Look at the Basic Model of the giant-scale services Focusing the challenges of High availability Evolution Growth Principles for the above Simplify the design of large systems

5 Basic Model (general) The “infrastructure services” Internet-based systems that provide instant messaging, wireless services and so on

6 Basic Model (general) We discuss Single-site Single-owner Well-connected cluster Perhaps a part of a larger service We do not discuss Wide are issues Network partitioning Low or discontinuous bandwidth Multiple admistrative domains Service monitoring Network QoS Security Log and logging analysis DBMS

7 Basic Model (general) We focus on High availability Replication Degradation Disaster tolerance Online evolution The scope is bridging the gap between the basic building block of giant-scale services and the real world scalability and availability they require

8 Basic Model (Advantages) Access anywhere, anytime Availability via multiple devices Groupware support Lower overall cost Simplified service updates

9 Basic Model (Advantages) Access anywhere, anytime The infrastructure is ubiquitous You can access the service from home, work airport and so on

10 Basic Model (Advantages) Availability via multiple devices The infrastructure handles the processing (the most at least) User access the services via set-top boxes, networks computer, smart phones and so on In that way we have offer more functionality for a given cost and battery life

11 Basic Model (Advantages) Groupware support Centralizing data from many users allowing group-ware application like Calendar Teleconferencing systems, and so on

12 Basic Model (Advantages) Lower overall cost Hard to measure overall cost but Infrastructure services have an advantage over designs based on stand alone devices High utilization Centralize administration reduce the cost, but harder to quantify

13 Basic Model (Advantages) Simplified service updates Updates without physical distribution The most powerful long term advantage

14 Basic Model (Components)

15 Basic Model (Assumptions) The service provider has limited control over the clients an the IP network Queries drive the service Read only queries outnumber greatly update queries Giant-scale services use CLUSTERS

16 Basic Model (Components) Clients, such as Web browsers. Initiate the queries to the services IP network, public Internet or a private network. Provides access to the service. Load manager, provides indirection between the service’s external name and the servers’ physical names (IP addresses). Load balancing. Proxies or firewalls before the load manager. Servers. Combining CPU, memory, and disks into an easy-to- replicate unit. Persistent data store, replicated or partitioned database spread across the servers. Optional external DBMSs or RAID storage. Backplane. Optional. Handles inter-server traffic.

17 Basic Model (Load Management) Round Robin DNS “Layer-4” switches understand TCP and port numbers “Layer-7” switches parses URL Custom “front-end” nodes They act like service specific “layer-7” routers Include the clients in the load balancing Ex alternative DNS or Name Server

18 Basic Model (Load Management) Two opposite approaches Simple Web Farm Search engine cluster

19 Basic Model (Load Management) Simple Web Farm

20 Basic Model (Load Management) Search engine cluster

21 High Availability (general) Like telephone, rail or water systems Features Extreme symmetry No people Few cables No external disks No monitors Inkotomi in addition Manages the cluster offline Limit temperature and power variations

22 High Availability (metrics)

23 High Availability (DQ principle) The systems overall capacity has a particular physical bottleneck Ex. Total I/O bandwidth, total seeks per second Total amount of data to be moved per second Measurable and tunable Ex. adding nodes, software optimization OR faults

24 High Availability (DQ principle) Focus on the relative DQ value, not on the absolute Define the DQ value of your system Normally DQ values scales linearly with the number of the nodes

25 High Availability (DQ principle) Analyzing the faults impact Focus on how DQ reduction influence the three metrics Only for data-intensive sites

26 High Availability Replication vs. Partitioning Replication 100% harvest 50% yield DQ -= 50% Maintain D Reduce Q Partitioning 50% Harvest 100% yield DQ -= 50% Reduce D Maintain Q Example: 2-node cluster. One down

27 High Availability Replication

28 High Availability Replication vs. Partitioning Replication wins if the bandwidth is the same. Extra cost is on the bandwidth not on the disks Easy recovering We might also use partial replication and randomization

29 High Availability Graceful degradation We can not avoid saturation, because Peak-to-average ratio 1.6:1 to 6:1. Expensive to build capacity above the (normal) peak Single events burst (ex. Online ticket sales for special events) Faults like power failures or natural disaster affect substantially the overall DQ and the remaining nodes become saturated. So, we MUST have mechanisms for degradation

30 High Availability Graceful degradation The DQ principle give us the options for Limit Q (capacity) to maintain D Reduce D and increase Q Focus on harvest by Admission Control (AC) Reduce Q Reduce D on dynamic databases Both Cut the effective database to half (new approach)

31 High Availability Graceful degradation More sophisticated techniques Cost based AC Estimate query cost Reduce the data per query Augment Q Priority (or value) based AC Drop low-valued queries Ex execute stock trade within 60s or the user pays no commission Reduced data freshness Reduce the freshness so reduce the work per query Increase yield at the expense of harvest

32 High Availability Disaster Tolerance Combination of managing replicas and graceful degradation How many locations? How many replicas on each location? Load management “Layer-4” switch do not help with the loss of a whole cluster Smart clients is the solution

33 Online Evolution & Growth We must plan for continuous growth and frequent functionality updates Maintenance and upgrades are controlled failures Total loss of DQ value is ΔDQ = n · u · average DQ/node = DQ · u Where n is the number of nodes and u the total amount per time a node requires for online evolution

34 Online Evolution & Growth Three approaches An example for a 4-node cluster

35 Conclusions The basic lessons learned Get the basics right Professional data center, layer-7 switch, symmetry Decide on your availability metrics Everyone must agree on the goals Harvest and yield > uptime Focus on MTTR at least as much as MTBF MTTR is easier and has the same impact Understand load redirection during faults Data replication is insufficient, you need excess DQ

36 Conclusions The basic lessons learned Graceful degradation is a critical part Intelligent admission control and dynamic database reduction Use DQ analysis on all upgrades Capacity planning Automate upgrades as much as possible Have a fast simple way to return to older version

37 Final Statement Smart clients could simplify all of the above