A Framework for Highly-Available Session-Oriented Internet Services Bhaskaran Raman, Prof. Randy H. Katz {bhaskar, The ICEBERG Project.

Slides:



Advertisements
Similar presentations
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Advertisements

Availability and Performance in Wide-Area Service Composition Bhaskaran Raman EECS, U.C.Berkeley July 2002.
1 MM3 - Reliability and Fault tolerance in Networks Service Level Agreements Jens Myrup Pedersen.
High Availability 24 hours a day, 7 days a week, 365 days a year… Vik Nagjee Product Manager, Core Technologies InterSystems Corporation.
Dinker Batra CLUSTERING Categories of Clusters. Dinker Batra Introduction A computer cluster is a group of linked computers, working together closely.
Wide-Area Service Composition: Evaluation of Availability and Scalability Bhaskaran Raman SAHARA, EECS, U.C.Berkeley Provider Q Texttoaudio Provider R.
1 Failure Recovery for Priority Progress Multicast Jung-Rung Han Supervisor: Charles Krasic.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
© 2001 Stanford Distinguishing P, S, D state n Persistent: loss inevitably affects application correctness, cannot easily be regenerated l Example: billing.
Metrics for Evaluating ICEBERG ICEBERG Retreat Breakout Session Jan 11, 2000 Coordinators: Chen-Nee Chuah & Jimmy Shih.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Informed Detour Selection Helps Reliability Boulat A. Bash.
G Robert Grimm New York University Scalable Network Services.
Wide-Area Service Composition: Availability, Performance, and Scalability Bhaskaran Raman SAHARA, EECS, U.C.Berkeley SAHARA Retreat, Jan 2002.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
1 Fault-tolerant Paths ISRG Retreat Z. Morley Mao 1/11/2000 Services Paths.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Internet-Scale Research at Universities Panel Session SAHARA Retreat, Jan 2002 Prof. Randy H. Katz, Bhaskaran Raman, Z. Morley Mao, Yan Chen.
G Robert Grimm New York University Scalable Network Services.
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
Wireless Video Sensor Networks Vijaya S Malla Harish Reddy Kottam Kirankumar Srilanka.
Lecture 8 Epidemic communication, Server implementation.
Problem Definition Data path –Created by the Automatic Path Creation (APC) component –Service: program with well-defined interface –Operator: stateless.
Availability in Wide-Area Service Composition Bhaskaran Raman and Randy H. Katz SAHARA, EECS, U.C.Berkeley.
National Manager Database Services
Ronen Gabbay Microsoft Regional Director Yside / Hi-Tech College
Disaster Recovery as a Cloud Service Chao Liu SUNY Buffalo Computer Science.
CS An Overlay Routing Scheme For Moving Large Files Su Zhang Kai Xu.
An Efficient Topology-Adaptive Membership Protocol for Large- Scale Cluster-Based Services Jingyu Zhou * §, Lingkun Chu*, Tao Yang* § * Ask Jeeves §University.
DISTRIBUTED ALGORITHMS Luc Onana Seif Haridi. DISTRIBUTED SYSTEMS Collection of autonomous computers, processes, or processors (nodes) interconnected.
An Architecture for Optimal and Robust Composition of Services across the Wide-Area Internet Bhaskaran Raman Qualifying Examination Proposal Feb 12, 2001.
High-Availability Linux.  Reliability  Availability  Serviceability.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
Gil EinzigerRoy Friedman Computer Science Department Technion.
A Framework for Highly-Available Cascaded Real-Time Internet Services Bhaskaran Raman Qualifying Examination Proposal Feb 12, 2001 Examination Committee:
1 Resilient and Coherence Preserving Dissemination of Dynamic Data Using Cooperating Peers Shetal Shah, IIT Bombay Kirthi Ramamritham, IIT Bombay Prashant.
A Framework for Highly-Available Real-Time Internet Services Bhaskaran Raman, EECS, U.C.Berkeley.
Types of Operating Systems
Performance and Availability in Wide-Area Service Composition Bhaskaran Raman ICEBERG, EECS, U.C.Berkeley Presentation at Siemens, June 2001.
Wide-Area Service Composition: Performance, Availability and Scalability Bhaskaran Raman SAHARA, EECS, U.C.Berkeley Presentation at Ericsson, Jan 2002.
Toward Fault-tolerant P2P Systems: Constructing a Stable Virtual Peer from Multiple Unstable Peers Kota Abe, Tatsuya Ueda (Presenter), Masanori Shikano,
Plethora: A Wide-Area Read-Write Storage Repository Design Goals, Objectives, and Applications Suresh Jagannathan, Christoph Hoffmann, Ananth Grama Computer.
LEGS: A WSRF Service to Estimate Latency between Arbitrary Hosts on the Internet R.Vijayprasanth 1, R. Kavithaa 2,3 and Raj Kettimuthu 2,3 1 Coimbatore.
Multihoming Performance Benefits: An Experimental Evaluation of Practical Enterprise Strategies Aditya Akella, CMU Srinivasan Seshan, CMU Anees Shaikh,
Types of Operating Systems 1 Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
1 Wide Area Network Emulation on the Millennium Bhaskaran Raman Yan Chen Weidong Cui Randy Katz {bhaskar, yanchen, wdc, Millennium.
August 23, 2001ITCom2001 Proxy Caching Mechanisms with Video Quality Adjustment Masahiro Sasabe Graduate School of Engineering Science Osaka University.
Use Cases for High Bandwidth Query and Control of Core Networks Greg Bernstein, Grotto Networking Young Lee, Huawei draft-bernstein-alto-large-bandwidth-cases-00.txt.
MiddleMan: A Video Caching Proxy Server NOSSDAV 2000 Brian Smith Department of Computer Science Cornell University Ithaca, NY Soam Acharya Inktomi Corporation.
Service Composition: Breakout Session Summary Randy Katz David Culler Summary: Bhaskaran Raman.
Hiearchial Caching in Traffic Server. Hiearchial Caching  A set of techniques and mechanisms to increase the size and performance of network caches.
Network Computing Laboratory Load Balancing and Stability Issues in Algorithms for Service Composition Bhaskaran Raman & Randy H.Katz U.C Berkeley INFOCOM.
A Framework for Composing Services Across Independent Providers in the Wide-Area Internet Bhaskaran Raman Qualifying Examination Proposal Feb 12, 2001.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Seminar On Rain Technology
Introduction Chapter 1. Introduction  A computer network is two or more computers connected together so they can communicate with one another.  Two.
1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.
PERFORMANCE MANAGEMENT IMPROVING PERFORMANCE TECHNIQUES Network management system 1.
Accelerating Peer-to-Peer Networks for Video Streaming
High Availability 24 hours a day, 7 days a week, 365 days a year…
Introduction to Load Balancing:
Chapter 16: Distributed System Structures
An Introduction to Computer Networking
Systems Issues for Scalable, Fault Tolerant Internet Services
Read this to find out how the internet works!
Research Issues in Middleware (Bhaskar)
Presentation transcript:

A Framework for Highly-Available Session-Oriented Internet Services Bhaskaran Raman, Prof. Randy H. Katz {bhaskar, The ICEBERG Project EECS, U.C.Berkeley

Resilient Session-Oriented Internet Services For Internet services (e.g., web-servers, proxies), robustness & high-availability are very important Session-Oriented Services –Client sessions live for long: many minutes to hours –E.g., Audio/Video transcoding proxies, Internet Video-on- demand, Game servers –Important, growing class of Internet applications For such services, high-availability means: –No session interruption at client in presence of failures Internet Web-server Web-cache Transcoding proxy Clients Internet Services

Motivation and Problem Scope Motivation – Internet components subject to failure –Should be hidden from end- clients –Services often critical to users –Session re-instantiation on failure little studied so far Goal – A framework for robust session-oriented services: –Handle failures at all levels Process, machine, site-failure, network partition –Quick recovery from failure Minimal service interruption for interactive sessions Assumption: Stateless services –Little or no state build-up at service during session All state at client(s) – problem of re-instantiating sessions is tractable –Examples: transcoding proxies, video-on-demand servers –We do not consider “stateful” services Stateful  Entirely different semantics for moving sessions

Our Framework Replicated Clusters –In the wide-area –Tolerance to site/network failures –Clusters could belong to independent service providers Orthogonality good for fault-tolerance Monitoring Mechanism –Peer-clusters monitor each other –Keep track of live-ness and performance behavior For backup/fail-over purposes Optimal choice of server instance –Not only for session setup but for the entire session duration –Session- transfer on detecting poor performance or failure Client Service Cluster

Support for Session-transfer Service cluster Cluster monitors performance of its clients – or gets performance reports from clients Peers know about each others’ clients – info exchanged during session setup Periodic heart-beats to keep track of liveness – timeouts to detect failure On cluster/network-failure, or poor performance, session is taken over by peer cluster Peer cluster Client

Support for Session-transfer (Contd.) Service cluster Process/machine failures handled within cluster – by cluster-manager Peer cluster Peer-monitoring done by manager-nodes – overhead amortized across many clients Manager node Service node Client Replicated cluster-manager for redundancy Aggregation of application performance information – based on client “location” – for choosing “best” peer-backup-cluster

Open Research Questions Location of replicated services –How do clusters find each other? –How do they agree to peer ? Who monitors who? Who can serve as backup? –Which peer is the best backup for a particular client? –What are the trust relationships between service providers? Issues with inter-cluster wide-area monitoring –How to detect failures quickly in the wide-area Internet? –With low overhead? –Issue of false detection of failure due to latency variations –Stability: when site fails, do not want to move all sessions to the same backup cluster this may cause load-collapse of entire system

Cluster location and peering: –Graph of peering clusters should correspond to “network- closeness”. E.g., –Peering can be similar to that for Internet backbone routers To identify “best” backup peer cluster for a client –Client clustering [Krishnamurthy & Wang, SIGCOMM’00] –Distance estimation [IDMaps, INFOCOM’99, INFOCOM’00] Design of heartbeat mechanism & timeouts –Studies of Internet delay behavior: [Allman & Paxson, SIGCOMM’99], [Acharya & Saltz, UMCP, ‘97] RTT spikes are temporal  possible to design tight and reliable heartbeat mechanism in the wide-area? Next Steps… SF LA Houston DC NY Chicago

Related Work (1)(2)(3)(4)(5) Scope of failure recovery Within cluster In the wide-area Local-areaWithin and across wide- area clusters Session-transfer capability NoYesNo Yes Summary of related work: 1.Cluster-based approaches: TACC [A. Fox et.al. SOSP’97], LARD [V.S. Pai et.al. ASPLOS ’98] 2.Video transcoding proxy: Active Services [E. Amir et.al. SIGCOMM 1998] 3.Web-mirror selection methods: SPAND [S. Seshan et.al. INFOCOM 2000] 4.Fault-tolerant distributed computing (not Internet services) 5.Our Approach: Wide-area replicated clusters + Monitoring Session-recovery in the wide-area not considered so far

Summary Requirement for robust session-oriented services: –Arose from use of transcoding services for device communication in hybrid networks [ICEBERG, U.C.Berkeley] –Availability requirements especially stringent for communication services (e.g., the telephone system) Framework for high availability –Wide-area monitoring mechanism betn. service cluster peers Approach promises low-overhead, low-delay fail-over –Amortization of control overhead with use of clusters –Common failure cases handled within cluster –Backup cluster peers for quick recovery from site/network failures Ongoing: prototype implementation and evaluation… –Study trade-off between (a) service re-instantiation time, (b) monitoring overhead, and (c) amount of pre-provisioning