Presentation is loading. Please wait.

Presentation is loading. Please wait.

Measurement-based Design

Similar presentations


Presentation on theme: "Measurement-based Design"— Presentation transcript:

1 Measurement-based Design
Xingbang Liu, Shuting Li

2 Presenter: Xingbang Liu
Kraken: Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services Presenter: Xingbang Liu

3 Modern Web Services Hundreds of systems A large number of machines

4 Challenges

5 Challenges Infrastructure heterogeneity Changing bottlenecks

6 Can we serve peak load? How to identify bottlenecks? Are we operating efficiently?

7 Common approaches to capacity management
Load modeling: simulate how system behaves at high load Load testing: benchmark using synthetic workloads

8 Live User Traffic Accurate distribution of reads & writes
Do not need a custom test setup

9 Kraken: Measure peak serving capacity
Monitor health metrics • Response latency • Server error Reset load when thresholds are hit

10 Contributions Kraken measures peak serving capacity at all scales
A single web server A single cluster An entire geographical region Kraken identifies bottlenecks limiting utilization Kraken increases Facebook’s infrastructure utilization by over 20%

11 Overview of traffic management
London users 0.7 Europe 0.2 With a smaller capacity 0.1 North America 0.8 0.1 0.1 with a larger capacity

12 Kraken shifts traffic to particular region or cluster
Edge weights Cluster weights Server weights

13 Kraken shifts traffic to the target
London users 0.7 Europe 0.2 0.1 Target Region 0.8 0.1 0.1

14 Kraken shifts traffic to the target
London users 0.1 Europe 0.2 0.7 Target Region 0.8 0.1 0.1

15 Kraken Overview

16 Evaluation Does Kraken allow us to validate capacity measurements at various scales?

17 Measuring an individual web server’s capacity
A: Load test begins and starts increasing load B: Kraken decreases load increments C: Reach Queuing latency threshold and start decreasing load D: Load test ends C B D A Peak web server capacity: 175 requests per second (RPS) target: 90% utilization i.e., 157 RPS

18 Measuring a cluster’s capacity
Max cluster capacity = (web server capacity) * (num. web servers in cluster)

19 Measuring a region’s capacity

20 Evaluation Does Kraken provide a useful methodology for increasing utilization?

21 Factors limiting utilization
A: Load test begins B: Kraken inspects the health of the cluster and makes a decision for how to shift traffic every 5 minutes C: response latency sustains above the threshold level and load test stops B A

22 Factors limiting utilization
1. Hash weights for cache Web server latency Cache latency Service latency Latency breakdown in cluster load test total latency = web server latency + cache latency + service latency

23 Factors limiting utilization
2.Network saturation Load test begins Load test starts decreasing load Load test ends 3% of all requests experience an error !!!

24 Factors limiting utilization
3. Poor load balancing Load test begins Load test starts decreasing load Load test ends Evident load imbalance !!!

25 Conclusion Kraken leverages live user traffic to empirically load test every level of the infrastructure stack to measure capacity. Describe a methodology for identifying bottlenecks to improve infrastructure utilization Kraken increases Facebook’s capacity to server users by over 20% using the same hardware

26 Lessons from Kraken Great ideas and Revolutions Resistance

27 History-Based Harvesting of Spare Cycles
 and Storage in Large-Scale Datacenters
Presenter: Shuting Li

28 Background: data centers are underutilized
Data centers are massive and expensive. Utilization rate can be as low as 30% Overprovision resources : Latency-critical (Require low tail response time) High peaks in user load Unexpected spikes and failures A

29 Background: utilize the spare resources
Co-locate the useful batch workloads Challenges 1: Interactive services have higher priority. (“primary tenant”) Resource-harvesting workload might be killed sometimes. (“secondary tenant”) Task killing : waste the cycle and the resources spent already!!

30 Background: utilize the spare resources
Improve data availability and durability. Challenges 2: A lot of batch processing applications requires massive data to operate. The space on the server might not be enough for the data block. If the owner decides to wipe out the disk. The data might be lost permanently. Slowing down one task will affect the entire batch processing job.

31 Main goals Improve the efficiency without sacrificing quality of service. Minimize the probability of killing batch tasks Maximize data availability and durability

32 Batch task scheduling : is there a pattern?
Time vs utilization Frequency spectrum A: 31 B

33 Batch task scheduling : make use of the patterns

34 Data storage co-location
Save diverse pattern in the same cluster of interactive service node. Maximize data availability and durability.

35 Replica placement Plot the servers into 9*9 grid according to peak utilization and disk reimage rate. Insure we only have one replica in a certain row and column. A C B

36 System implementation
Clustering service Extract utilization and reimaging patterns YARN-H Protect interactive services by killing batch tasks Tez-H History-based batch task scheduling HDFS-H History based replica placement Protect interactive services by denying accesses

37 Evaluation: experiment environment
Real-system deployment 102-server cluster Interactive service: Lucene with utilization trace Batch task: TPC-DS queries on Hive Large-scale simulation Trace from 10 production datacenters at Microsoft Full datacenters for one month Production environment deployment Data replica placement

38 Evaluation: Batch task scheduling

39 Evaluation: Batch task scheduling -simulation

40 Evaluation: Replica placement -durability
Deployed to thousands of production servers for almost a year Eliminated data losses except minor bugs and not enough diversity

41 Conclusion History-based resource harvesting
Resource utilization dynamics Data storage co-location Complex data analytics distributed across servers Significantly improve datacenter efficiency Can also be applied to resources other than CPU

42 Discussion(Kraken) How does Kraken measure the cluster capacity when it has heterogeneous servers? If a cluster has several kinds of servers, they use Kraken to first calculate each kind of server’s capacity to get the total cluster capacity.

43 Discussion(Kraken) Does Kraken influence the system’s performance?
This influence might be small. It uses response time and error rate as metrics.

44 Discussion(Kraken) Can we apply Kraken to stateful servers?

45 Discussion Can we apply the techniques introduced in the paper to resources other than CPU? Yes. We can use the same way to find the pattern of historical data and classify the resources according to the pattern.

46 Discussion Are 3 different classes enough to include all kinds of the servers?


Download ppt "Measurement-based Design"

Similar presentations


Ads by Google