Measurement-based Design Xingbang Liu, Shuting Li
Presenter: Xingbang Liu Kraken: Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services Presenter: Xingbang Liu
Modern Web Services Hundreds of systems A large number of machines
Challenges
Challenges Infrastructure heterogeneity Changing bottlenecks
Can we serve peak load? How to identify bottlenecks? Are we operating efficiently?
Common approaches to capacity management Load modeling: simulate how system behaves at high load Load testing: benchmark using synthetic workloads
Live User Traffic Accurate distribution of reads & writes Do not need a custom test setup
Kraken: Measure peak serving capacity Monitor health metrics • Response latency • Server error Reset load when thresholds are hit
Contributions Kraken measures peak serving capacity at all scales A single web server A single cluster An entire geographical region Kraken identifies bottlenecks limiting utilization Kraken increases Facebook’s infrastructure utilization by over 20%
Overview of traffic management London users 0.7 Europe 0.2 With a smaller capacity 0.1 North America 0.8 0.1 0.1 with a larger capacity
Kraken shifts traffic to particular region or cluster Edge weights Cluster weights Server weights
Kraken shifts traffic to the target London users 0.7 Europe 0.2 0.1 Target Region 0.8 0.1 0.1
Kraken shifts traffic to the target London users 0.1 Europe 0.2 0.7 Target Region 0.8 0.1 0.1
Kraken Overview
Evaluation Does Kraken allow us to validate capacity measurements at various scales?
Measuring an individual web server’s capacity A: Load test begins and starts increasing load B: Kraken decreases load increments C: Reach Queuing latency threshold and start decreasing load D: Load test ends C B D A Peak web server capacity: 175 requests per second (RPS) target: 90% utilization i.e., 157 RPS
Measuring a cluster’s capacity Max cluster capacity = (web server capacity) * (num. web servers in cluster)
Measuring a region’s capacity
Evaluation Does Kraken provide a useful methodology for increasing utilization?
Factors limiting utilization A: Load test begins B: Kraken inspects the health of the cluster and makes a decision for how to shift traffic every 5 minutes C: response latency sustains above the threshold level and load test stops B A
Factors limiting utilization 1. Hash weights for cache Web server latency Cache latency Service latency Latency breakdown in cluster load test total latency = web server latency + cache latency + service latency
Factors limiting utilization 2.Network saturation Load test begins Load test starts decreasing load Load test ends 3% of all requests experience an error !!!
Factors limiting utilization 3. Poor load balancing Load test begins Load test starts decreasing load Load test ends Evident load imbalance !!!
Conclusion Kraken leverages live user traffic to empirically load test every level of the infrastructure stack to measure capacity. Describe a methodology for identifying bottlenecks to improve infrastructure utilization Kraken increases Facebook’s capacity to server users by over 20% using the same hardware
Lessons from Kraken Great ideas and Revolutions Resistance
History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters Presenter: Shuting Li
Background: data centers are underutilized Data centers are massive and expensive. Utilization rate can be as low as 30% Overprovision resources : Latency-critical (Require low tail response time) High peaks in user load Unexpected spikes and failures A
Background: utilize the spare resources Co-locate the useful batch workloads Challenges 1: Interactive services have higher priority. (“primary tenant”) Resource-harvesting workload might be killed sometimes. (“secondary tenant”) Task killing : waste the cycle and the resources spent already!!
Background: utilize the spare resources Improve data availability and durability. Challenges 2: A lot of batch processing applications requires massive data to operate. The space on the server might not be enough for the data block. If the owner decides to wipe out the disk. The data might be lost permanently. Slowing down one task will affect the entire batch processing job.
Main goals Improve the efficiency without sacrificing quality of service. Minimize the probability of killing batch tasks Maximize data availability and durability
Batch task scheduling : is there a pattern? Time vs utilization Frequency spectrum A: 31 B
Batch task scheduling : make use of the patterns
Data storage co-location Save diverse pattern in the same cluster of interactive service node. Maximize data availability and durability.
Replica placement Plot the servers into 9*9 grid according to peak utilization and disk reimage rate. Insure we only have one replica in a certain row and column. A C B
System implementation Clustering service Extract utilization and reimaging patterns YARN-H Protect interactive services by killing batch tasks Tez-H History-based batch task scheduling HDFS-H History based replica placement Protect interactive services by denying accesses
Evaluation: experiment environment Real-system deployment 102-server cluster Interactive service: Lucene with utilization trace Batch task: TPC-DS queries on Hive Large-scale simulation Trace from 10 production datacenters at Microsoft Full datacenters for one month Production environment deployment Data replica placement
Evaluation: Batch task scheduling
Evaluation: Batch task scheduling -simulation
Evaluation: Replica placement -durability Deployed to thousands of production servers for almost a year Eliminated data losses except minor bugs and not enough diversity
Conclusion History-based resource harvesting Resource utilization dynamics Data storage co-location Complex data analytics distributed across servers Significantly improve datacenter efficiency Can also be applied to resources other than CPU
Discussion(Kraken) How does Kraken measure the cluster capacity when it has heterogeneous servers? If a cluster has several kinds of servers, they use Kraken to first calculate each kind of server’s capacity to get the total cluster capacity.
Discussion(Kraken) Does Kraken influence the system’s performance? This influence might be small. It uses response time and error rate as metrics.
Discussion(Kraken) Can we apply Kraken to stateful servers?
Discussion Can we apply the techniques introduced in the paper to resources other than CPU? Yes. We can use the same way to find the pattern of historical data and classify the resources according to the pattern.
Discussion Are 3 different classes enough to include all kinds of the servers?