Download presentation
Presentation is loading. Please wait.
1
Systems Issues for Scalable, Fault Tolerant Internet Services Yatin Chawathe Eric Brewer To appear in Middleware ’98 http://www.cs.berkeley.edu/~yatin/papers/sns-crc.ps
2
Motivation Proliferation of network-based servicesProliferation of network-based services Two critical issues must be addressed by Internet services:Two critical issues must be addressed by Internet services: –System scalability Incremental and linear scalabilityIncremental and linear scalability –Availability and fault tolerance 24x7 operation24x7 operation
3
A Reusable SNS Framework Clusters of workstations are ideal for Internet services [FGC+97]Clusters of workstations are ideal for Internet services [FGC+97] But, clusters are difficult to manageBut, clusters are difficult to manage –To ensure linear scalability, service must distribute load across the cluster –Service must grow the cluster with increasing load –Partial failures within a cluster complicate fault management Isolate common requirements of cluster-based Internet apps into a reusable substrate -- the Scalable Network Services (SNS) framework Isolate common requirements of cluster-based Internet apps into a reusable substrate -- the Scalable Network Services (SNS) framework
4
Architecture SNSManagerSNSManager InternalNetwork WorkerWorker Worker Driver WorkerWorker WorkerWorker WorkerWorker WorkerWorker... Outside World
5
Workers Workers are grouped into classes. Within a class, workers are identicalWorkers are grouped into classes. Within a class, workers are identical Workers can receive tasks from the outside world, or from other workersWorkers can receive tasks from the outside world, or from other workers Workers have a simple serial interface for tasksWorkers have a simple serial interface for tasks –The originator sends a task to the consumer by specifying the class and inputs for the task –Tasks are atomic and restartable –Worker Drivers present a narrow interface between the SNS substrate and the worker application
6
Centralized SNS Manager SNS Manager is intentionally centralizedSNS Manager is intentionally centralized –makes it easier to reason about and implement the various policies –“all” we need to do is ensure the fault tolerance of the manager, and make sure it is not a performance bottleneck Three key functionsThree key functions –Resource location –Load balancing and scalability –Fault tolerance
7
Resource Location WorkerWorker Worker Driver WorkerWorker SNSManagerSNSManager Multicast Beacons Register Find Found PersistentConnection
8
Load Balancing Load measurement and reportingLoad measurement and reporting –Each worker examines incoming requests and estimates the “load” that would be generated –Simplest load metric: queue length at workers –Workers periodically report their current load to the SNS Manager –SNS Manager maintains load history and aggregates load reports from all workers –Load reports are piggybacked on manager beacons to rest of the system
9
Load Balancing Each worker performs local load balancing decisionsEach worker performs local load balancing decisions Use lottery scheduling -- # of tickets are inversely proportional to worker loadUse lottery scheduling -- # of tickets are inversely proportional to worker load Stale load reports can cause oscillationsStale load reports can cause oscillations –Use a correction factor based on the number of requests that were sent since last load report
10
Auto-launch for Scalability Worker replication to handle short traffic burstsWorker replication to handle short traffic bursts –Multiple workers handle requests in parallel –If load on a class of workers gets too high, the SNS Manager launches a new one Overflow pool for long burstsOverflow pool for long bursts –non-dedicated set of machines (e.g. users’ desktop machines) –when all dedicated nodes are exhausted, harness an overflow node; release it after burst subsides –useful for incremental scalability
11
Fault Tolerance Starfish Fault toleranceStarfish Fault tolerance –“Peer” monitoring as opposed to primary/secondary fault tolerance Two mechanisms:Two mechanisms: –Timeouts and retries –Preemptive detection and component restart Reliance on soft state simplifies crash recoveryReliance on soft state simplifies crash recovery
12
Fault Tolerance WorkerWorker Worker Driver WorkerWorker WorkerWorker SNSManagerSNSManager SNSManagerSNSManager AmRestarting SNSManagerSNSManagerSNSManagerSNSManager SNSManagerSNSManager ReRegister
13
Example Applications TranSendTranSend –Web proxy for on-the-fly content distillation WingmanWingman –The world’s only graphical web browser for the 3COM PalmPilot TopGun MediaboardTopGun Mediaboard –PDA groupware: shared electronic whiteboard for the 3COM PalmPilot MARSMARS –MBone archive server
14
Evaluation
15
Evaluation
16
Evaluation Worker 2 started Worker 3 started Workers 4 & 5started
17
Summary Reusable architecture substrate for building Internet service applicationsReusable architecture substrate for building Internet service applications Application developers program their services to a well-defined narrow interfaceApplication developers program their services to a well-defined narrow interface SNS takes care of resource location, spawning, load balancing, fault toleranceSNS takes care of resource location, spawning, load balancing, fault tolerance Number of interesting applications on top of the SNS substrateNumber of interesting applications on top of the SNS substrate Next step: SNSv2 NINJANext step: SNSv2 NINJA
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.