Mohammad Hajjat Purdue University Joint work with: Shankar P N (Purdue), David Maltz (Microsoft), Sanjay Rao (Purdue) and Kunwadee Sripanidkulchai (NECTEC.

Slides:



Advertisements
Similar presentations
Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Advertisements

Cost-Based Cache Replacement and Server Selection for Multimedia Proxy Across Wireless Internet Qian Zhang Zhe Xiang Wenwu Zhu Lixin Gao IEEE Transactions.
Traffic Engineering with Forward Fault Correction (FFC)
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Cloud Control with Distributed Rate Limiting Raghaven et all Presented by: Brian Card CS Fall Kinicki 1.
OpenFlow-Based Server Load Balancing GoneWild
Closer to the Cloud - A Case for Emulating Cloud Dynamics by Controlling the Environment Ashiwan Sivakumar Shankaranarayanan P N Sanjay Rao School of Electrical.
SPANStore: Cost-Effective Geo-Replicated Storage Spanning Multiple Cloud Services Zhe Wu, Michael Butkiewicz, Dorian Perkins, Ethan Katz-Bassett, Harsha.
Efficient Autoscaling in the Cloud using Predictive Models for Workload Forecasting Roy, N., A. Dubey, and A. Gokhale 4th IEEE International Conference.
COMMA: Coordinating the Migration of Multi-tier applications 1 Jie Zheng* T.S Eugene Ng* Kunwadee Sripanidkulchai† Zhaolei Liu* *Rice University, USA †NECTEC,
Cloudward Bound: Planning for Beneficial Migration of Enterprise Applications to the Cloud Mohammad Hajjat, Xin Sun, Yu-Wei Sung (Purdue University) David.
Network Architecture for Joint Failure Recovery and Traffic Engineering Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Overview Of Microsoft New Technology ENTER. Processing....
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Rethinking Internet Traffic Management: From Multiple Decompositions to a Practical Protocol Jiayue He Princeton University Joint work with Martin Suchara,
Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean Rhea, Byung-Gon Chun, John Kubiatowicz, and Scott Shenker UC Berkeley (and now MIT) December.
RRAPID: Real-time Recovery based on Active Probing, Introspection, and Decentralization Takashi Suzuki Matthew Caesar.
Wide Web Load Balancing Algorithm Design Yingfang Zhang.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Wide-Area Traffic Management COS 597E: Software Defined Networking.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Dynamic Load Balancing on Web-server Systems Valeria Cardellini, Michele Colajanni, and Philip S. Yu Presented by Sui-Yu Wang.
Supervisor: Hadi Salimi Abdollah Ebrahimi Mazandaran University Of Science & Technology January,
New Challenges in Cloud Datacenter Monitoring and Management
Computer Science Cataclysm: Policing Extreme Overloads in Internet Applications Bhuvan Urgaonkar and Prashant Shenoy University of Massachusetts.
A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.
 Zhichun Li  The Robust and Secure Systems group at NEC Research Labs  Northwestern University  Tsinghua University 2.
DaVinci: Dynamically Adaptive Virtual Networks for a Customized Internet Jennifer Rexford Princeton University With Jiayue He, Rui Zhang-Shen, Ying Li,
Optimizing Cost and Performance in Online Service Provider COSC7388 – Advanced Distributed Computing Presented By: Eshwar Rohit
Monitoring Latency Sensitive Enterprise Applications on the Cloud Shankar Narayanan Ashiwan Sivakumar.
On the Scale and Performance of Cooperative Web Proxy Caching University of Washington Alec Wolman, Geoff Voelker, Nitin Sharma, Neal Cardwell, Anna Karlin,
Network Aware Resource Allocation in Distributed Clouds.
Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
1 CS 425 Distributed Systems Fall 2011 Slides by Indranil Gupta Measurement Studies All Slides © IG Acknowledgments: Jay Patel.
C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Marco Canini (UCL) with Lalith Suresh, Stefan Schmid, Anja Feldmann (TU Berlin)
Tony McGregor RIPE NCC Visiting Researcher The University of Waikato DAR Active measurement in the large.
Cloud Data Center – Chicago Designed 2007/ Opened 2009 Generation 2 Deployment (SLA )Generation 3 Deployment (SLA 99.9) Physical Redundancy N+2,
Timecard: Controlling User-Perceived Delays in Server-Based Mobile Applications Lenin Ravindranath, Jitu Padhye, Ratul Mahajan, Hari Balakrishnan.
Scalable Multi-Class Traffic Management in Data Center Backbone Networks Amitabha Ghosh (UtopiaCompression) Sangtae Ha (Princeton) Edward Crabbe (Google)
Enabling Conferencing Applications on the Internet using an Overlay Multicast Architecture Yang-hua Chu, Sanjay Rao, Srini Seshan and Hui Zhang Carnegie.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Load Distribution among Replicated Web Servers: A QoS-based Approach Marco Conti, Enrico Gregori, Fabio Panzieri WISP KAIST EECSD CALab Hwang.
MIDDLEWARE SYSTEMS RESEARCH GROUP Adaptive Content-based Routing In General Overlay Topologies Guoli Li, Vinod Muthusamy Hans-Arno Jacobsen Middleware.
DaVinci: Dynamically Adaptive Virtual Networks for a Customized Internet Jiayue He, Rui Zhang-Shen, Ying Li, Cheng-Yen Lee, Jennifer Rexford, and Mung.
Intradomain Traffic Engineering By Behzad Akbari These slides are based in part upon slides of J. Rexford (Princeton university)
Windows Azure Virtual Machines Anton Boyko. A Continuous Offering From Private to Public Cloud.
1 Hidra: History Based Dynamic Resource Allocation For Server Clusters Jayanth Gummaraju 1 and Yoshio Turner 2 1 Stanford University, CA, USA 2 Hewlett-Packard.
Client Assignment in Content Dissemination Networks for Dynamic Data Shetal Shah Krithi Ramamritham Indian Institute of Technology Bombay Chinya Ravishankar.
Emir Halepovic, Jeffrey Pang, Oliver Spatscheck AT&T Labs - Research
Scalability == Capacity * Density.
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.
Internet Measurement and Analysis Vinay Ribeiro Shriram Sarvotham Rolf Riedi Richard Baraniuk Rice University.
Architecture for Resource Allocation Services Supporting Interactive Remote Desktop Sessions in Utility Grids Vanish Talwar, HP Labs Bikash Agarwalla,
WINDOWS AZURE AND THE HYBRID CLOUD. Hybrid Concepts and Cloud Services.
Cloud-based movie search web application with transaction service Group 14 Yuanfan Zhang Ji Zhang Zhuomeng Li.
Dynamic Resource Allocation for Shared Data Centers Using Online Measurements By- Abhishek Chandra, Weibo Gong and Prashant Shenoy.
Chen Qian, Xin Li University of Kentucky
Resilient Datacenter Load Balancing in the Wild
Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.
Measurement-based Design
ISP and Egress Path Selection for Multihomed Networks
On the Scale and Performance of Cooperative Web Proxy Caching
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
What I Learned Making a Global Web App
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
AWS Cloud Computing Masaki.
Building global and highly-available services using Windows Azure
2019/5/13 A Weighted ECMP Load Balancing Scheme for Data Centers Using P4 Switches Presenter:Hung-Yen Wang Authors:Peng Wang, George Trimponias, Hong Xu,
Presentation transcript:

Mohammad Hajjat Purdue University Joint work with: Shankar P N (Purdue), David Maltz (Microsoft), Sanjay Rao (Purdue) and Kunwadee Sripanidkulchai (NECTEC Thailand) 1 Dealer: Application-aware Request Splitting for Interactive Cloud Applications

Performance of Interactive Applications 2  Interactive apps  stringent requirements on user response time  Amazon: every 100ms latency cost 1% in sales  Google: 0.5 sec’s delay increase  traffic and revenue drop by 20%  Tail importance: SLA’s defined on 90%ile and higher response time =

Cloud Computing: Benefits and Challenges 3  Benefits:  Elasticity  Cost-savings  Geo-distribution:  Service resilience, disaster recovery, better user experience, etc.  Challenges:  Performance is variable: [Ballani’11], [ Wang’10], [Li’10], [Mangot’09], etc.  Even worse, data-centers fail Average of $5,600 per minute!

Approaches for Handling Cloud Performance Variability 4 Autoscaling? Can’t tackle storage problems, network congestion, etc. Slow: tens of mins in public clouds DNS-based and Server-based Redirection? Overload remote DC Waste local resources DNS-based schemes may take hours to react

Contributions 5 Introduce Dealer to help interactive multi-tier applications respond to transient variability in performance in cloud Split requests at component granularity (rather entire DC)  Pick best combination of component replicas (potentially across multiple DC’s) to serve each individual request Benefits over naïve approaches: o Wide range of variability in cloud (performance problems, network congestion, workload spikes, failures, etc.) o Short time scale adaptation (tens of seconds to few minutes) o Performance tail (90 th percentile and higher): o Under natural cloud dynamics > 6x o Redirection schemes: e.g., DNS-based load-balancers > 3x

Outline 6 Introduction Measurement and Observations System Design Evaluation

Performance Variability in Multi-tier Interactive Applications 7 Thumbnail Application Work erRol e R ol e IISIIS eb R ol e IISIIS Web Role IIS Load Balancer orker Role Worker Role blob BE BL 1 FE Queue BL 2 Work erRol e orker Role Worker Role Multi-tier apps may consist of hundreds of components Deploy each app on 2 DC’s simultaneously

Performance Variability in Multi-tier Interactive Applications 25 th median 75 th Outliers 8 BE BL 1 BL 2 FE

Observations 9 Replicas of a component are uncorrelated Few components show poor performance at any time Performance problems are short-lived; 90% < 4 mins BE BL 1 BL 2 FE BE BL 1 BL 2 FE

Outline 10 Introduction Measurement and Observations System Design Evaluation

C1C1 C2C2 C3C3 CnCn C4C4 C1C1 C2C2 C3C3 CnCn C4C4 Dealer Approach: Per-Component Re-routing 11 Split req’s at each component dynamically Serve each req using a combination of replicas across multiple DC’s GTM

12 C1C1 C2C2 C3C3 CnCn C1C1 C2C2 C3C3 CnCn C1C1 C2C2 C3C3 CnCn Dealer GTM Dealer System Overview

Dealer High Level Design 13 Determine Delays Compute Split-Ratios Application Dynamic Capacity Estimation Stability

Determining Delays 14 Monitoring: Instrument apps to record: Component processing time Inter-component delay Use X-Trace for instrumentation, uses global ID Automate integration using Aspect Oriented Programming (AOP) Push logs asynchronously to reduce overhead Active Probing: Send req’s along lightly used links and comps Use workload generators (e.g., Grinder) Heuristics for faster recovery by biasing towards better paths Determine Delays Compute Split-Ratios Application Dynamic Capacity Estimation Stability

Determining Delays 15 Monitoring Probing Delay matrix D[,]: component processing and inter- component communication delay Transaction matrix T[,]: transactions rate between components Combine Estimates Stability & Smoothing

FE BL1 BE BL2 user C1C2C3 C4 C5 C 22 C 42 C 32 C 52 C 12 C 21 C 41 C 31 C 51 C 11 Calculating Split Ratios 16 BE BL 1 BL 2 FE Determine Delays Compute Split-Ratios Application Dynamic Capacity Estimation Stability

C 21 C 41 C 31 C 11 C 22 C 42 C 32 C 12 C 51 C 52 Calculating Split Ratios 17 Given: Delay matrix D[im, jn] Transaction matrix T[i,j] Capacity matrix C[i,m] (capacity of component i in data-center m) Goal: Find Split-ratios TF[im, jn]: # of transactions between each pair of components C im and C jn s.t. overall delay is minimized Algorithm: greedy algorithm that assigns requests to the best performing combination of replicas (across DC’s)

Other Design Aspects 18 Dynamic Capacity Estimation: Develop algorithm to dynamically capture capacities of comps Prevent comps getting overloaded by re-routed traffic Stability: multiple levels: Smooth matrices with Weighted Moving Average (WMA) Damp Split-Ratios by a factor to avoid abrupt shifts Integration with Apps: Can be integrated with any app (stateful; e.g., StockTrader) Provide generic pull/push API’s Determine Delays Compute Split-Ratios Application Dynamic Capacity Estimation Stability Dynamic Capacity Estimation Stability

Outline 19 Introduction Measurement and Observations System Design Evaluation

20 Real multi-tier, interactive apps: Thumbnails: photo processing, data-intensive Stocktrader: stock trading, delay-sensitive, stateful 2 Azure datacenters in US Workload: Real workload trace from big campus ERP app DaCapo benchmark Comparison with existing schemes: DNS-based redirection Server-based redirection Performance variability scenarios (single fault domain failure, storage latency, transaction mix change, etc.)

Running In the Wild 21 More than 6x difference Evaluate Dealer under natural cloud dynamics Explore inherent performance variability in cloud environments

Running In the Wild 22 FE BEBL FEBEBL ABAB

Dealer vs. GTM 23 Global Traffic Managers (GTM’s) use DNS to route user IP’s to closest DC Best performing DC ≠ closest DC (measured by RTT) Results: more than 3x improvement for 90 th percentile and higher

Dealer vs. Server-level Redirection 24 Re-route entire request, granularity of DC’s HTTP 302 DC A DC B

Evaluating Against Server-level Redirection 25 FE BE BL 1 BL 2 FEBE BL 1 BL 2

Conclusions 26 Dealer: novel technique to handle cloud variability in multi- tier interactive apps Per-component re-routing: dynamically split user req’s across replicas in multiple DC’s at component granularity Transient cloud variability: performance problems in cloud services, workload spikes, failures, etc. Short time scale adaptation: tens of seconds to few mins Performance tail improvement: o Natural cloud dynamics > 6x o Coarse-grain Redirection: e.g., DNS-based GTM > 3x

27 Questions?