Systems Issues for Scalable, Fault Tolerant Internet Services Yatin Chawathe Eric Brewer To appear in Middleware ’98

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Tableau Software Australia
Distributed Processing, Client/Server and Clusters
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.
ITIS 3110 Jason Watson. Replication methods o Primary/Backup o Master/Slave o Multi-master Load-balancing methods o DNS Round-Robin o Reverse Proxy.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Distributed Systems 1 Topics  What is a Distributed System?  Why Distributed Systems?  Examples of Distributed Systems  Distributed System Requirements.
Spark: Cluster Computing with Working Sets
CREAM-CE status and evolution plans Paolo Andreetto, Sara Bertocco, Alvise Dorigo, Eric Frizziero, Alessio Gianelle, Massimo Sgaravatto, Lisa Zangrando.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
NETWORK LOAD BALANCING NLB.  Network Load Balancing (NLB) is a Clustering Technology.  Windows Based. (windows server).  To scale performance, Network.
Distributed components
A Dependable Auction System: Architecture and an Implementation Framework
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Topics ACID vs BASE Starfish Availability TACC Model Transend Measurements SNS Architecture.
G Robert Grimm New York University Scalable Network Services.
OCT1 Principles From Chapter One of “Distributed Systems Concepts and Design”
“ Adapting to Network and Client Variation Using Infrastructural Proxies : Lessons and Perspectives ” University of California Berkeley Armando Fox, Steven.
Big Infrastructure, Small Clients Prof. Eric A. Brewer
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Topics ACID vs BASE Starfish Availability TACC Model Transend Measurements SNS Architecture.
Squirrel: A decentralized peer- to-peer web cache Paul Burstein 10/27/2003.
G Robert Grimm New York University Scalable Network Services.
Lesson 1: Configuring Network Load Balancing
Presentation on Clustering Paper: Cluster-based Scalable Network Services; Fox, Gribble et. al Internet Services Suman K. Grandhi Pratish Halady.
Post-PC Summary Prof. Eric A. Brewer
.NET Mobile Application Development Introduction to Mobile and Distributed Applications.
DISTRIBUTED COMPUTING
Configuring Print Services Lesson 7. Skills Matrix Technology SkillObjective DomainObjective # Deploying a Print ServerConfigure and monitor print services.
H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
Server Load Balancing. Introduction Why is load balancing of servers needed? If there is only one web server responding to all the incoming HTTP requests.
1 The Google File System Reporter: You-Wei Zhang.
MobSched: An Optimizable Scheduler for Mobile Cloud Computing S. SindiaS. GaoB. Black A.LimV. D. AgrawalP. Agrawal Auburn University, Auburn, AL 45 th.
Module 13: Network Load Balancing Fundamentals. Server Availability and Scalability Overview Windows Network Load Balancing Configuring Windows Network.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
An Efficient Topology-Adaptive Membership Protocol for Large- Scale Cluster-Based Services Jingyu Zhou * §, Lingkun Chu*, Tao Yang* § * Ask Jeeves §University.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Cluster-Based Scalable Network Service Author: Armando Steven D.Gribble Steven D.Gribble Yatin Chawathe Yatin Chawathe Eric A. Brewer Eric A. Brewer Paul.
WebFlow High-Level Programming Environment and Visual Authoring Toolkit for HPDC (desktop access to remote resources) Tomasz Haupt Northeast Parallel Architectures.
NINJA. Project of UC Berkeley Computer Science Division Paper : The Ninja Architecture for Robust Internet-Scale Systems and Services
Network Computing Laboratory Load Balancing and Stability Issues in Algorithms for Service Composition Bhaskaran Raman & Randy H.Katz U.C Berkeley INFOCOM.
Configuring Print Services Lesson 7. Print Sharing Print device sharing is another one of the most basic applications for which local area networks were.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Cluster-Based Scalable
REPLICATION & LOAD BALANCING
Introduction to Distributed Platforms
Affinity Depending on the application and client requirements of your Network Load Balancing cluster, you can be required to select an Affinity setting.
Replication Middleware for Cloud Based Storage Service
An Introduction to Computer Networking
Systems Issues for Scalable, Fault Tolerant Internet Services
Software models - Software Architecture Design Patterns
Indirect Communication Paradigms (or Messaging Methods)
Indirect Communication Paradigms (or Messaging Methods)
CS703 - Advanced Operating Systems
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Systems Issues for Scalable, Fault Tolerant Internet Services Yatin Chawathe Eric Brewer To appear in Middleware ’98

Motivation Proliferation of network-based servicesProliferation of network-based services Two critical issues must be addressed by Internet services:Two critical issues must be addressed by Internet services: –System scalability Incremental and linear scalabilityIncremental and linear scalability –Availability and fault tolerance 24x7 operation24x7 operation

A Reusable SNS Framework Clusters of workstations are ideal for Internet services [FGC+97]Clusters of workstations are ideal for Internet services [FGC+97] But, clusters are difficult to manageBut, clusters are difficult to manage –To ensure linear scalability, service must distribute load across the cluster –Service must grow the cluster with increasing load –Partial failures within a cluster complicate fault management Isolate common requirements of cluster-based Internet apps into a reusable substrate -- the Scalable Network Services (SNS) framework Isolate common requirements of cluster-based Internet apps into a reusable substrate -- the Scalable Network Services (SNS) framework

Architecture SNSManagerSNSManager InternalNetwork WorkerWorker Worker Driver WorkerWorker WorkerWorker WorkerWorker WorkerWorker... Outside World

Workers Workers are grouped into classes. Within a class, workers are identicalWorkers are grouped into classes. Within a class, workers are identical Workers can receive tasks from the outside world, or from other workersWorkers can receive tasks from the outside world, or from other workers Workers have a simple serial interface for tasksWorkers have a simple serial interface for tasks –The originator sends a task to the consumer by specifying the class and inputs for the task –Tasks are atomic and restartable –Worker Drivers present a narrow interface between the SNS substrate and the worker application

Centralized SNS Manager SNS Manager is intentionally centralizedSNS Manager is intentionally centralized –makes it easier to reason about and implement the various policies –“all” we need to do is ensure the fault tolerance of the manager, and make sure it is not a performance bottleneck Three key functionsThree key functions –Resource location –Load balancing and scalability –Fault tolerance

Resource Location WorkerWorker Worker Driver WorkerWorker SNSManagerSNSManager Multicast Beacons Register Find Found PersistentConnection

Load Balancing Load measurement and reportingLoad measurement and reporting –Each worker examines incoming requests and estimates the “load” that would be generated –Simplest load metric: queue length at workers –Workers periodically report their current load to the SNS Manager –SNS Manager maintains load history and aggregates load reports from all workers –Load reports are piggybacked on manager beacons to rest of the system

Load Balancing Each worker performs local load balancing decisionsEach worker performs local load balancing decisions Use lottery scheduling -- # of tickets are inversely proportional to worker loadUse lottery scheduling -- # of tickets are inversely proportional to worker load Stale load reports can cause oscillationsStale load reports can cause oscillations –Use a correction factor based on the number of requests that were sent since last load report

Auto-launch for Scalability Worker replication to handle short traffic burstsWorker replication to handle short traffic bursts –Multiple workers handle requests in parallel –If load on a class of workers gets too high, the SNS Manager launches a new one Overflow pool for long burstsOverflow pool for long bursts –non-dedicated set of machines (e.g. users’ desktop machines) –when all dedicated nodes are exhausted, harness an overflow node; release it after burst subsides –useful for incremental scalability

Fault Tolerance Starfish Fault toleranceStarfish Fault tolerance –“Peer” monitoring as opposed to primary/secondary fault tolerance Two mechanisms:Two mechanisms: –Timeouts and retries –Preemptive detection and component restart Reliance on soft state simplifies crash recoveryReliance on soft state simplifies crash recovery

Fault Tolerance WorkerWorker Worker Driver WorkerWorker WorkerWorker SNSManagerSNSManager SNSManagerSNSManager AmRestarting SNSManagerSNSManagerSNSManagerSNSManager SNSManagerSNSManager ReRegister

Example Applications TranSendTranSend –Web proxy for on-the-fly content distillation WingmanWingman –The world’s only graphical web browser for the 3COM PalmPilot TopGun MediaboardTopGun Mediaboard –PDA groupware: shared electronic whiteboard for the 3COM PalmPilot MARSMARS –MBone archive server

Evaluation

Evaluation

Evaluation Worker 2 started Worker 3 started Workers 4 & 5started

Summary Reusable architecture substrate for building Internet service applicationsReusable architecture substrate for building Internet service applications Application developers program their services to a well-defined narrow interfaceApplication developers program their services to a well-defined narrow interface SNS takes care of resource location, spawning, load balancing, fault toleranceSNS takes care of resource location, spawning, load balancing, fault tolerance Number of interesting applications on top of the SNS substrateNumber of interesting applications on top of the SNS substrate Next step: SNSv2 NINJANext step: SNSv2 NINJA