Download presentation
Presentation is loading. Please wait.
Published byKyleigh Dresser Modified over 9 years ago
1
The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch
2
Internet services are complex Michael Chow2 Tasks Time Scale and heterogeneity make Internet services complex
3
Internet services are complex Michael Chow3 Tasks Time Scale and heterogeneity make Internet services complex
4
Analysis Pipeline Michael Chow4
5
Step 1: Identify segments Michael Chow5
6
Step 2: Infer causal model Michael Chow6
7
Step 3: Analyze individual requests Michael Chow7
8
Step 4: Aggregate results Michael Chow8
9
Challenges Previous methods derive a causal model – Instrument scheduler and communication – Build model through human knowledge Michael Chow9 Need method that works at scale with heterogeneous components
10
Opportunities Component-level logging is ubiquitous Handle a large number of requests Michael Chow10 Tremendous detail about a request’s execution Coverage of a large range of behaviors
11
The Mystery Machine 1) Infer causal model from large corpus of traces – Identify segments – Hypothesize all possible causal relationships – Reject hypotheses with observed counterexamples 2) Analysis – Critical path, slack, anomaly detection, what-if Michael Chow11
12
Step 1: Identify segments Michael Chow12
13
Define a minimal schema Michael Chow13 Segment 1Segment 2 Task Event
14
Define a minimal schema Michael Chow14 Segment 1Segment 2 Task Event Request identifier Machine identifier Timestamp Task Event Aggregate existing logs using minimal schema
15
Step 2: Infer causal model Michael Chow15
16
Types of causal relationships Michael Chow16 RelationshipCounterexample Happens-Before Mutual Exclusion Pipeline B B A A A A B B B B A A A A B B OR B B A A B B A A C C B’ A’ C’ B B A A C C B’ C’ A’ t1t1 t2t2 t1t1 t2t2
17
Producing causal model Michael Chow17 Causal Model S1N1C1 S2N2C2
18
Producing causal model Michael Chow18 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1 Time
19
Producing causal model Michael Chow19 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1 Time
20
Producing causal model Michael Chow20 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Time Trace 2
21
Producing causal model Michael Chow21 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Time Trace 2
22
Producing causal model Michael Chow22 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Time Trace 2
23
Step 3: Analyze individual requests Michael Chow23
24
Critical path using causal model Michael Chow24 S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1 Time
25
Critical path using causal model Michael Chow25 S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1
26
Critical path using causal model Michael Chow26 S1N1C1 S2N2C2 C1N1 S1 C2N2S2 Trace 1
27
Step 4: Aggregate results Michael Chow27
28
Inaccuracies of Naïve Aggregation Michael Chow28
29
Inaccuracies of Naïve Aggregation Michael Chow29
30
Inaccuracies of Naïve Aggregation Michael Chow30 Need a causal model to correctly understand latency
31
High variance in critical path Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow31 ServerNetwork Client Percent of critical path cdf
32
High variance in critical path Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow32 ServerNetwork Client Percent of critical path cdf 20% of requests, server contributes 10% of latency
33
High variance in critical path Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow33 ServerNetwork Client Percent of critical path cdf 20% of requests, server contributes 50% or more of latency
34
Diverse clients and networks Michael Chow34 Server Network Client Server Network Client Server Network Client
35
Diverse clients and networks Michael Chow35 Server Network Client Server Network Client Server Network Client
36
Diverse clients and networks Michael Chow36 Server Network Client Server Network Client Server Network Client
37
Differentiated service Michael Chow37 Deliver data when needed and reduce average response time No slack in server generation time Produce data faster Decrease end-to-end latency Slack in server generation time Produce data slower End-to-end latency stays same
38
Additional analysis techniques Slack analysis What-if analysis – Use natural variation in large data set Michael Chow38 C1N1 S1 C2N2S2 Time
39
What-if questions Does server generation time affect end-to-end latency? Can we predict which connections exhibit server slack? Michael Chow39
40
Server slack analysis Michael Chow40 Slack < 25msSlack > 2.5s
41
Server slack analysis Michael Chow41 Slack < 25msSlack > 2.5s End-to-end latency increases as server generation time increases Server generation time has little effect on end- to-end latency
42
Predicting server slack Predict slack at the receipt of a request Past slack is representative of future slack Michael Chow42 First slack (ms) Second slack (ms)
43
Predicting server slack Predict slack at the receipt of a request Past slack is representative of future slack Michael Chow43 Classifies 83% of requests correctly Type II Error 9% Type I Error 8% First slack (ms) Second slack (ms)
44
Conclusion Michael Chow44
45
Questions Michael Chow45
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.