Download presentation
Presentation is loading. Please wait.
Published byDonald Parks Modified over 9 years ago
1
Providing Performance Guarantees for Cloud Applications Anshul Gandhi IBM T. J. Watson Research Center Stony Brook University 1 Parijat Dube, Alexei Karve, Andrew Kochut, Li Zhang IBM T. J. Watson Research Center
2
Motivation Businesses have started moving to the cloud for their IT needs ─ reduces capital cost of buying servers ─ allows for elastic resizing of applications that have dynamic workload demand Cloud Service Providers (CSPs) offer monitoring and rule-based triggers to enable dynamic scaling of applications Amazon auto scaling Microsoft Azure Watch 2 Time Demand ?
3
Motivation The values have to be determined by the user ─ requires expert knowledge of application (CPU, memory, n/w thresholds) ─ requires performance modeling expertise (when and how to scale) Amazon auto scaling Microsoft Azure Watch How to set these values ?? 3
4
Motivation The values have to be determined by the user ─ requires expert knowledge of application (CPU, memory, n/w thresholds) ─ requires performance modeling expertise (when and how to scale) 4 arrival rate (req/s) 95% Resp. time (ms) 400 ms 60 req/s Offline benchmarking Trial-and-error Expert application knowledge The values are critical for application performance and require expertise
5
Goal Take the burden away from the user ─ user only specifies the performance requirement ─ CSP should be fully responsible for scaling 5 Offline benchmarking Trial-and-error Expert application knowledge Not possible for CSPs ! arrival rate (req/s) 95% Resp. time (ms) 400 ms 60 req/s
6
6 View from user’s perspective
7
7 View from CSP’s perspective
8
8 Problem statement How to scale an unobservable cloud application to provide performance guarantees ?
9
9 Outline 1.Existing CSP solutions (survey) 2.Our solution: Dependable Compute Cloud (DC2) a)High-level idea behind DC2 b)DC2 system architecture c)Modeling + Optimization (Queueing) (Kalman filtering) 3.Evaluation RUBiS (eBay.com) multi-tier application Various traces Various workload mixes 4.Limitations and future work
10
Existing CSP solutions Resource usage triggers ─Amazon Auto Scaling, Microsoft Azure Watch, VMware AppInsight, CiRBA Request rate for specific software (ex: apache) ─RightScale Latency/VM ─Amazon Elastic Load balancing Web site response time ─Scalr 10 User has to set values
11
11 Outline 1.Existing CSP solutions (survey) 2.Our solution: Dependable Compute Cloud (DC2) a)High-level idea behind DC2 b)DC2 system architecture c)Modeling + Optimization (Queueing) (Kalman filtering) 3.Evaluation RUBiS (eBay.com) multi-tier application Various traces Various workload mixes 4.Limitations and future work
12
12 DC2: High-level idea VM utilization Request rate End-to-end response time Service requirements of requests at each tier Network delay Background utilization (overhead)
13
13 DC2: High-level idea Service requirements of requests at each tier Network delay Background utilization (overhead) VM utilization Request rate End-to-end response time Kalman filtering
14
14 DC2: High-level idea Service requirements of requests at each tier Network delay Background utilization (overhead) VM utilization Request rate End-to-end response time Kalman filtering
15
15 Outline 1.Existing CSP solutions (survey) 2.Our solution: Dependable Compute Cloud (DC2) a)High-level idea behind DC2 b)DC2 system architecture c)Modeling + Optimization (Queueing) (Kalman filtering) 3.Evaluation RUBiS (eBay.com) multi-tier application Various traces Various workload mixes 4.Limitations and future work
16
16 DC2: System architecture DC2User initial topology + perf SLA resource monitoring application monitoring scaling directives modeling+optimization
17
17 Outline 1.Existing CSP solutions (survey) 2.Our solution: Dependable Compute Cloud (DC2) a)High-level idea behind DC2 b)DC2 system architecture c)Modeling + Optimization (Queueing) (Kalman filtering) 3.Evaluation RUBiS (eBay.com) multi-tier application Various traces Various workload mixes 4.Limitations and future work
18
18 DC2: Modeling + Optimization Given initial topology, how to dynamically scale application in a cost-effective manner to ensure user-specified SLA compliance? T SLA
19
19 DC2: Modeling multi-tier queueing network model home browse buy
20
20 DC2: Modeling λ1λ1 λ2λ2 λ3λ3 T1T1 T2T2 T3T3 S 11 S 21 S 31 S 13 S 23 S 33 S 12 S 22 S 32 U0 j didi Parameters: λ i – Request rate for class i T i – Response time for class i S ij – Service requirement for class i at tier j d i – Network latency for class i U0 j – Background utilization on tier j U j – Utilization of tier j 24 parameters 9 known + 15 unknown 6 equations
21
21 DC2: Modeling Parameters: λ i – Request rate for class i T i – Response time for class i S ij – Service requirement for class i at tier j d i – Network latency for class i U0 j – Background utilization on tier j U j – Utilization of tier j 24 parameters 9 known + 15 unknown 6 equations Underdetermined system Need to “infer” unknowns Can leverage monitored values Kalman filtering Observed states: {λ i, T i, U j } Hidden states: {S ij, d i, U0 j } “Guess” unknowns Evaluate functions using guesses Compare with monitored values Improve guess
22
22 Kalman filtering “Guess” unknowns Evaluate functions using guesses Compare with monitored values Improve guess KF is a reactive, feedback-based estimation approach that has only recently been employed for computer systems KF automatically learns the (possibly changing) system parameters, for any system, including combination of workloads We extend KF to a 3-tier 3-workload-class system Based on KF estimation, DC2 automatically, and proactively, detects which tier is the bottleneck, and how to resolve the bottleneck (scale VMs) ─ do not require any knowledge of application, except topology
23
23 Kalman filtering + Queueing “Guess” unknowns Evaluate functions using guesses Compare with monitored values Improve guess KF can be integrated with system models (ex, queueing models) to improve accuracy and convergence Model need not be accurate ─ KF leverages (true) monitored values to account for model inaccuracies ─ Well suited for approximate system models such as queueing-theoretic models ─ Can use other models as well, ex: machine-learning based models
24
24 Kalman filtering + Queueing: Evaluation Time to converge ~1 min (6 intervals) Good accuracy Change in workload triggered Time to converge ~3 min (18 intervals) Good accuracy
25
25 Outline 1.Existing CSP solutions (survey) 2.Our solution: Dependable Compute Cloud (DC2) a)High-level idea behind DC2 b)DC2 system architecture c)Modeling + Optimization (Queueing) (Kalman filtering) 3.Evaluation RUBiS (eBay.com) multi-tier application Various traces Various workload mixes 4.Limitations and future work
26
26 Experimental setup SoftLayer machines OpenStack 8-core, 8 GB RUBiS
27
27
28
28 RUBiS RUBiS is an open source benchmark inspired by ebay.com We focus on scaling Tomcat app tier Different workload classes (home, browse, buy) 4 vCPU 2 vCPU SLA: T browse < 40ms for every 10 s monitoring interval
29
29 Demo
30
30 Evaluation Bursty trace [WITS]Hill trace [ITA]Rampdown trace [WITS] Base MoreDB MoreApp MoreWeb
31
31 DC2: All traces Bursty trace [WITS]Hill trace [ITA]Rampdown trace [WITS]
32
32 THRES(x,y): Bursty trace
33
33 Bursty trace: All policies Bursty trace [WITS]Hill trace [ITA]Rampdown trace [WITS] STATIC-OPT DC2 THRES(30,60) THRES(30,50) THRES(40,60) V=0% K=3.00V=0% K=4.00V=0% K=6.00 V=0% K=2.50V=0% K=2.44V=0% K=4.76 V=0% K=2.50V=6.66% K=2.56V=0% K=6.00 V=0% K=2.79V=0% K=2.72V=0% K=6.00 V=2.02% K=2.19V=15.87% K=2.13V=0% K=4.62
34
34 All traces: All policies Bursty trace [WITS]Hill trace [ITA]Rampdown trace [WITS] STATIC-OPT DC2 THRES(30,60) THRES(30,50) THRES(40,60) V=0% K=3.00V=0% K=4.00V=0% K=6.00 V=0% K=2.50V=0% K=2.44V=0% K=4.76 V=0% K=2.50V=6.66% K=2.56V=0% K=6.00 V=0% K=2.79V=0% K=2.72V=0% K=6.00 V=2.02% K=2.19V=15.87% K=2.13V=0% K=4.62
35
35 All workloads: All policies (Bursty trace) STATIC-OPT DC2 THRES(30,60) V=0% K=3.00V=0% K=4.00V=0% K=3.00 V=0% K=2.50V=0% K=3.66V=0% K=2.94V=0% K=2.87 V=0% K=2.50V=3.06% K=3.40V=2.04% K=2.98V=0% K=3.00 Base MoreDB MoreApp MoreWeb Rule-based policies like THRES require tuning and are not robust Other auto-scaling policies require control of application DC2 is superior to THRES and does not require application control
36
36 Other applications Bottleneck analysis ─HeavyDB use case What-if analysis ─Optimal VM configuration Can be combined with forecasting models Can be combined with other system models ─ML-based instead of queueing models
37
37 Outline 1.Existing CSP solutions (survey) 2.Our solution: Dependable Compute Cloud (DC2) a)High-level idea behind DC2 b)DC2 system architecture c)Modeling + Optimization (Queueing) (Kalman filtering) 3.Evaluation RUBiS (eBay.com) multi-tier application Various traces Various workload mixes 4.Limitations and future work
38
38 Limitations and future work Evaluation limited to dynamic web applications ─Currently investigating Hadoop-type applications Only applies to stateless tiers ─DB scaling would be challenging Scaling algorithm can be further improved ─Add delayedoff Non-zero convergence time Tunable parameters ─Response time threshold (35ms) ─Monitoring interval (10s)
39
39 Conclusions Need for adaptive scaling services for (opaque) cloud applications ─Application agnostic ─Robust to arrival pattern and workload mix Existing commercial offerings do not suffice: rule-based Existing auto-scaling research solutions do not apply due to lack of visibility and control of opaque cloud applications Our solution: Dependable Compute Cloud (DC2) ─Does not require offline benchmarking or expert knowledge ─Can adapt to dynamic changes in workload Well suited for cloud users who lack expertise in system modeling and application knowledge
40
40 Thank You !
41
41 Conclusions Need for adaptive scaling services for (opaque) cloud applications ─Application agnostic ─Robust to arrival pattern and workload mix Existing commercial offerings do not suffice: rule-based Existing auto-scaling research solutions do not apply due to lack of visibility and control of opaque cloud applications Our solution: Dependable Compute Cloud (DC2) ─Does not require offline benchmarking or expert knowledge ─Can adapt to dynamic changes in workload Well suited for cloud users who lack expertise in system modeling and application knowledge
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.