Scalable Applications and Real Time Response Ashish Motivala CS 614 April 17 th 2001.

Slides:



Advertisements
Similar presentations
Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University.
Advertisements

Distributed Data Processing
Distributed Processing, Client/Server and Clusters
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
Distributed Processing, Client/Server, and Clusters
Distributed components
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
City University London
Distributed Hardware How are computers interconnected ? –via a bus-based –via a switch How are processors and memories interconnected ? –Private –shared.
G Robert Grimm New York University Porcupine.
Distributed Systems 2006 Retrofitting Reliability* *With material adapted from Ken Birman.
1 Porcupine: A Highly Available Cluster-based Mail Service Yasushi Saito Brian Bershad Hank Levy University of Washington Department of Computer Science.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
Lesson 1: Configuring Network Load Balancing
Manageability, Availability and Performance in Porcupine: A Highly Scalable, Cluster-based Mail Service.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
1 A Framework for Highly Available Services Based on Group Communication Alan Fekete Idit Keidar University of Sidney MIT.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
CS294, YelickApplications, p1 CS Applications of Reliable Distributed Systems
CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 24: Nov. 16.
World Wide Web Caching: Trends and Technology Greg Barish and Katia Obraczka USC Information Science Institute IEEE Communications Magazine, May 2000 Presented.
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Porcupine: A Highly Available Cluster- based Mail Service Y. Saito, B. Bershad, H. Levy U. Washington.
Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Server Load Balancing. Introduction Why is load balancing of servers needed? If there is only one web server responding to all the incoming HTTP requests.
Word Wide Cache Distributed Caching for the Distributed Enterprise.
Section 11.1 Identify customer requirements Recommend appropriate network topologies Gather data about existing equipment and software Section 11.2 Demonstrate.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Module 12: Designing High Availability in Windows Server ® 2008.
1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.
Thanks to Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction n What is an Operating System? n Mainframe Systems.
CS Storage Systems Lecture 14 Consistency and Availability Tradeoffs.
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
Manageability, Availability and Performance in Porcupine: A Highly Scalable, Cluster-based Mail Service Yasushi Saito, Brian N Bershad and Henry M.Levy.
CH2 System models.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
2/1/00 Porcupine: a highly scalable service Authors: Y. Saito, B. N. Bershad and H. M. Levy This presentation by: Pratik Mukhopadhyay CSE 291 Presentation.
Lec4: TCP/IP, Network management model, Agent architectures
Unit – I CLIENT / SERVER ARCHITECTURE. Unit Structure  Evolution of Client/Server Architecture  Client/Server Model  Characteristics of Client/Server.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Operating System Concepts Chapter One: Introduction What is an operating system? Simple Batch Systems Multiprogramming Systems Time-Sharing Systems Personal-Computer.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Databases Illuminated
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Geo-distributed Messaging with RabbitMQ
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
70-412: Configuring Advanced Windows Server 2012 services
Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.
BIG DATA/ Hadoop Interview Questions.
1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.
Chapter 1: Introduction
CHAPTER 3 Architectures for Distributed Systems
Active replication for fault tolerance
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Software System Testing
Database System Architectures
Presentation transcript:

Scalable Applications and Real Time Response Ashish Motivala CS 614 April 17 th 2001

Scalable Applications and Real Time Response Using Group Communication Technology to Implement a Reliable and Scalable Distributed IN Coprocessor; Roy Friedman and Ken Birman; TINA Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service; Yasushi Saito, Brian N. Bershad and Henry M. Levy; Proceedings of the 17th ACM Symposium on Operating Systems Principles, 1999, Pages 1 – 15.

Real-time Two categories of real-time –When an action needs to be predictably fast. i.e. Critical applications. –When an action must be taken before a time limit passes. More often than not real-time doesn’t mean “as fast as possible” but means “slow and steady”.

Real problems need real-time Air Traffic Control, Free Flight –when planes are at various locations. Medical Monitoring, Remote Tele-surgery –doctors talk about how patients responded after drug was given, or change therapy after some amount of time. Process control software, Robot actions –a process controller runs factory floors by coordinating machine tools activities.

More real-time problems Video and multi-media systems –synchronous communication protocols that coordinate video, voice, and other data sources Telecommunications systems –guarantee real-time response despite failures, for example when switching telephone calls

Predictability If this is our goal… –Any well-behaved mechanism may be adequate –But we should be careful about uncommon disruptive cases For example, cost of failure handling is often overlooked Risk is that an infrequent scenario will be very costly when it occurs

Predictability: Examples Probabilistic multicast protocol –Very predictable if our desired latencies are larger than the expected convergence –Much less so if we seek latencies that bring us close to the expected latency of the protocol itself

Back to the paper Telephone networks need a mixture of properties –Real-time response –High performance –Stable behavior even when failures and recoveries occur Can we use our tools to solve such a problem?

Role of coprocessor A simple database –Switch does a query How should I route a call to from ? Reply: use output line 6 –Time limit of 100ms on transaction Call ID, call conferencing, automatic transferring, voice menus, etc Update database

IN coprocessor SS7 switch SS7 switch SS7 switch SS7 switch

IN coprocessor SS7 switch SS7 switch SS7 switch SS7 switch coprocessor

Present coprocessor Right now, people use hardware fault- tolerant machines for this –E.g. Stratus “pair and a spare” –Mimics one computer but tolerates hardware failures –Performance an issue?

Goals for coprocessor Requirements –Scalability: ability to use a cluster of machines for the same task, with better performance when we use more nodes –Fault-tolerance: a crash or recovery shouldn’t disrupt the system –Real-time response: must satisfy the 100ms limit at all times Downtime: any period when a series of requests might all be rejected Desired: 7 to 9 nines availability

SS7 experiment Horus runs the “800 number database” on a cluster of processors next to the switch Provide replication management tools Provide failure detection and automatic configuration

IN coprocessor example SS7 switch Query Element (QE) processors do the number lookup (in- memory database). Goals: scalable memory without loss of processing performance as number of nodes is increased Switch itself asks for help when remote number call is sensed External adaptor (EA) processors run the query protocol EA Primary backup scheme adapted (using small Horus process groups) to provide fault-tolerance with real-time guarantees

Options? A simple scheme: –Organize nodes as groups of 2 processes –Use virtual synchrony multicast For query For response Also for updates and membership tracking

IN coprocessor example SS7 switch EA Step 1: Switch sees incoming request

IN coprocessor example SS7 switch EA Step 2: Switch waits while EA procs. multicast request to group of query elements (“partitioned” database)

IN coprocessor example SS7 switch Think EA Step 3: The query elements do the query in duplicate

IN coprocessor example SS7 switch EA Step 4: They reply to the group of EA processes

IN coprocessor example SS7 switch EA Step 5: EA processes reply to switch, which routes call

Results!! Terrible performance! –Solution has 2 Horus multicasts on each critical path –Experience: about 600 queries per second but no more Also: slow to handle failures –Freezes for as long as 6 seconds Performance doesn’t improve much with scale either

Next try Consider taking Horus off the critical path Idea is to continue using Horus –It manages groups –And we use it for updates to the database and for partitioning the QE set But no multicasts on critical path –Instead use a hand-coded scheme Use Sender Ordering (or fifo) instead of Total Ordering

Hand-coded scheme Queue up a set of requests from an EA to a QE Periodically (15 ms), sweep the set into a message and send as a batch Process queries also as a batch Send the batch of replies back to EA

Clever twists Split into a primary and secondary EA for each request –Secondary steps in if no reply seen in 50ms –Batch size calculated so that 50ms should be “long enough” Alternate primary and secondary after each request.

Handling Failure and Overload Failure –QE: backup EA reissues request after half the deadline, without waiting for the failure detector –EA: the other EA takes over and handles all the requests Overload –Drop requests if there is no chance of servicing them, rather than missing all deadlines –High and low watermarks

Results Able to sustain 22,000 emulated telephone calls per second Able to guarantee response within 100ms and no more than 3% of calls are dropped (randomly) Performance is not hurt by a single failure or recovery while switch is running Can put database in memory: memory size increases with number of nodes in cluster

Other settings with a strong temporal element Load balancing –Idea is to track load of a set of machines –Can do this at an access point or in the client –Then want to rebalance by issuing requests preferentially to less loaded servers

Load balancing in farms Akamai widely cited –They download the rarely-changing content from customer web sites –Distribute this to their own web farm –Then use a hacked DNS to redirect web accesses to a close-by, less-loaded machine Real-time aspects? –The data on which this is based needs to be fresh or we’ll send to the wrong server

Conclusions Protocols like pbcast are potentially appealing in a subset of applications that are naturally probabilistic to begin with, and where we may have knowledge of expected load levels, etc. More traditional virtual synchrony protocols with strong consistency properties make more sense in standard networking settings

Future directions in real-time Expect GPS time sources to be common within five years Real-time tools like periodic process groups will also be readily available (members take actions in a temporally coordinated way) Increasing focus on predictable high performance rather than provable worst-case performance Increasing use of probabilistic techniques

Dimensions of Scalability We often say that we want systems that “scale” But what does scalability mean? As with reliability & security, the term “scalability” is very much in the eye of the beholder

Scalability As a reliability question: –Suppose a system experiences some rate of disruptions r –How does r change as a function of the size of the system? If r rises when the system gets larger we would say that the system scales poorly Need to ask what “disruption” means, and what “size” means…

Scalability As a management question –Suppose it takes some amount of effort to set up the system –How does this effort rise for a larger configuration? –Can lead to surprising discoveries E.g. the 2-machine demo is easy, but setup for 100 machines is extremely hard to define

Scalability As a question about throughput –Suppose the system can do t operations each second –Now I make the system larger Does t increase as a function of system size? Decrease? Is the behavior of the system stable, or unstable?

Scalability As a question about dependency on configuration –Many technologies need to know something about the network setup or properties –The larger the system, the less we know! –This can make a technology fragile, hard to configure, and hence poorly scalable

Scalability As a question about costs –Most systems have a basic cost E.g. 2pc “costs” 3N messages –And many have a background overhead E.g. gossip involves sending one message per round, receiving (on avg) one per round, and doing some retransmission work (rarely) Can ask how these costs change as we make our system larger, or make the network noisier, etc

Scalability As a question about environments –Small systems are well-behaved –But large ones are more like the Internet Packet loss rates and congestion can be problems Performance gets bursty and erratic More heterogeneity of connections and of machines on which applications run –The larger the environment, the nastier it may be!

Scalability As a pro-active question –How can we design for scalability? –We know a lot about technologies –Are certain styles of system more scalable than others?

Approaches Many ways to evaluate systems: –Experiments on the real system –Emulation environments –Simulation –Theoretical (“analytic”) But we need to know what we want to evaluate

Dangers “Lies, damn lies, and statistics” –It is much to easy to pick some random property of a system, graph it as a function of something, and declare success –We need sophistication in designing our evaluation or we’ll miss the point Example: message overhead of gossip –Technically, O(n) –Does any process or link see this cost? Perhaps not, if protocol is designed carefully

Technologies TCP/IP and O/S message-passing architectures like U-Net RPC and client-server architectures Transactions and nested transactions Virtual synchrony and replication Other forms of multicast Object oriented architectures Cluster management facilities

You’ve Got Mail Cluster research has focused on web services Mail is an example of a write-intensive application –disk-bound workload –reliability requirements –failure recovery Mail servers have relied on “brute force” approach to scaling –Big-iron file server, RDBMS

Conventional Mail Servers User DB Server popdsendmail NFS Server NFS Server Static partitioning Performance problems: No dynamic load balancing Manageability problems: Manual data partition decision Availability problems: Limited fault tolerance

Porcupine’s Goals Use commodity hardware to build a large, scalable mail service Performance: Linear increase with cluster size Manageability: React to changes automatically Availability: Survive failures gracefully 1 billion messages/day (100x existing systems) 100 million users (10x existing systems) 1000 nodes (50x existing systems)

Key Techniques and Relationships Functional Homogeneity “any node can perform any task” Automatic Reconfiguration Load Balancing Replication Manageability Performance Availability Framework Techniques Goals

Porcupine Architecture Node A... Node B Node Z... SMTP server POP server IMAP server Mail map Mailbox storage User profile Replication Manager Membership Manager RPC Load Balancer User map

Basic Data Structures “bob” BCACABAC bob : {A,C} ann : {B} BCACABAC suzy : {A,C} joe : {B} BCACABAC Apply hash function User map Mail map /user info Mailbox storage ABC Bob’s MSGs Suzy’s MSGs Bob’s MSGs Joe’s MSGs Ann’s MSGs Suzy’s MSGs

Porcupine Operations Internet AB... A 1. “send mail to bob” 2. Who manages bob?  A 3. “Verify bob” 5. Pick the best nodes to store new msg  C DNS-RR selection 4. “OK, bob has msgs on C and D 6. “Store msg” B C Protocol handling User lookup Load Balancing Message store... C

Measurement Environment 30 node cluster of not-quite-all-identical PCs 100Mb/s Ethernet + 1Gb/s hubs Linux ,000 lines of C++ code Synthetic load Compare to sendmail+popd

Performance Goals Scale performance linearly with cluster size Strategy: Avoid creating hot spots Partition data uniformly among nodes Fine-grain data partition

How does Performance Scale? 68m/day 25m/day

Availability Goals: Maintain function after failures React quickly to changes regardless of cluster size Graceful performance degradation / improvement Strategy: Hard state: messages, user profile  Optimistic fine-grain replication Soft state: user map, mail map  Reconstruction after membership change

Soft-state Reconstruction BCABABAC bob : {A,C} joe : {C} BCABABAC BAABABAB bob : {A,C} joe : {C} BAABABAB ACACACAC bob : {A,C} joe : {C} ACACACAC suzy : {A,B} ann : {B} 1. Membership protocol Usermap recomputation 2. Distributed disk scan suzy : ann : Timeline A B ann : {B} BCABABAC suzy : {A,B} C ann : {B} BCABABAC suzy : {A,B} ann : {B} BCABABAC suzy : {A,B}

How does Porcupine React to Configuration Changes?

Hard-state Replication Goals: Keep serving hard state after failures Handle unusual failure modes Strategy: Exploit Internet semantics Optimistic, eventually consistent replication Per-message, per-user-profile replication Efficient during normal operation Small window of inconsistency

How Efficient is Replication? 68m/day 24m/day

How Efficient is Replication? 68m/day 24m/day 33m/day

Load balancing: Deciding where to store messages Goals: Handle skewed workload well Support hardware heterogeneity Strategy: Spread-based load balancing Spread: soft limit on # of nodes per mailbox Large spread  better load balance Small spread  better affinity Load balanced within spread Use # of pending I/O requests as the load measure

How Well does Porcupine Support Heterogeneous Clusters? +16.8m/day (+25%) +0.5m/day (+0.8%)

Claims Symmetric function distribution Distribute user database and user mailbox –Lazy data management Self-management –Automatic load balancing, membership management Graceful Degradation –Cluster remains functional despite any number of failures

Retrospect Questions: –How does the system scale? –How costly is the failure recovery procedure? Two scenarios tested –Steady state –Node failure Does Porcupine scale? –Papers says “yes” –But in their work we can see a reconfiguration disruption when nodes fail or recover With larger scale, frequency of such events will rise And the cost is linear in system size –Very likely that on large clusters this overhead would become dominant!

Some Other Interesting Papers The Next Generation Internet: Unsafe at any Speed? Ken Birman Lessons from Giant-Scale Services Eric Brewer, UCB