The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch.

The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch

Internet services are complex Michael Chow2 Tasks Time Scale and heterogeneity make Internet services complex

Internet services are complex Michael Chow3 Tasks Time Scale and heterogeneity make Internet services complex

Analysis Pipeline Michael Chow4

Step 1: Identify segments Michael Chow5

Step 2: Infer causal model Michael Chow6

Step 3: Analyze individual requests Michael Chow7

Step 4: Aggregate results Michael Chow8

Challenges Previous methods derive a causal model – Instrument scheduler and communication – Build model through human knowledge Michael Chow9 Need method that works at scale with heterogeneous components

Opportunities Component-level logging is ubiquitous Handle a large number of requests Michael Chow10 Tremendous detail about a request’s execution Coverage of a large range of behaviors

The Mystery Machine 1) Infer causal model from large corpus of traces – Identify segments – Hypothesize all possible causal relationships – Reject hypotheses with observed counterexamples 2) Analysis – Critical path, slack, anomaly detection, what-if Michael Chow11

Step 1: Identify segments Michael Chow12

Define a minimal schema Michael Chow13 Segment 1Segment 2 Task Event

Define a minimal schema Michael Chow14 Segment 1Segment 2 Task Event Request identifier Machine identifier Timestamp Task Event Aggregate existing logs using minimal schema

Step 2: Infer causal model Michael Chow15

Types of causal relationships Michael Chow16 RelationshipCounterexample Happens-Before Mutual Exclusion Pipeline B B A A A A B B B B A A A A B B OR B B A A B B A A C C B’ A’ C’ B B A A C C B’ C’ A’ t1t1 t2t2 t1t1 t2t2

Producing causal model Michael Chow17 Causal Model S1N1C1 S2N2C2

Producing causal model Michael Chow18 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1 Time

Producing causal model Michael Chow19 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1 Time

Producing causal model Michael Chow20 Causal Model S1N1C1 S2N2C2 C1N1S1 C2N2S2 Time Trace 2

Step 3: Analyze individual requests Michael Chow23

Critical path using causal model Michael Chow24 S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1 Time

Critical path using causal model Michael Chow25 S1N1C1 S2N2C2 C1N1S1 C2N2S2 Trace 1

Critical path using causal model Michael Chow26 S1N1C1 S2N2C2 C1N1 S1 C2N2S2 Trace 1

Step 4: Aggregate results Michael Chow27

Inaccuracies of Naïve Aggregation Michael Chow28

Inaccuracies of Naïve Aggregation Michael Chow29

Inaccuracies of Naïve Aggregation Michael Chow30 Need a causal model to correctly understand latency

High variance in critical path Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow31 ServerNetwork Client Percent of critical path cdf

High variance in critical path Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow32 ServerNetwork Client Percent of critical path cdf 20% of requests, server contributes 10% of latency

High variance in critical path Breakdown in critical path shifts drastically – Server, network, or client can dominate latency Michael Chow33 ServerNetwork Client Percent of critical path cdf 20% of requests, server contributes 50% or more of latency

Diverse clients and networks Michael Chow34 Server Network Client Server Network Client Server Network Client

Differentiated service Michael Chow37 Deliver data when needed and reduce average response time No slack in server generation time Produce data faster Decrease end-to-end latency Slack in server generation time Produce data slower End-to-end latency stays same

Additional analysis techniques Slack analysis What-if analysis – Use natural variation in large data set Michael Chow38 C1N1 S1 C2N2S2 Time

What-if questions Does server generation time affect end-to-end latency? Can we predict which connections exhibit server slack? Michael Chow39

Server slack analysis Michael Chow40 Slack < 25msSlack > 2.5s

Server slack analysis Michael Chow41 Slack < 25msSlack > 2.5s End-to-end latency increases as server generation time increases Server generation time has little effect on end- to-end latency

Predicting server slack Predict slack at the receipt of a request Past slack is representative of future slack Michael Chow42 First slack (ms) Second slack (ms)

Predicting server slack Predict slack at the receipt of a request Past slack is representative of future slack Michael Chow43 Classifies 83% of requests correctly Type II Error 9% Type I Error 8% First slack (ms) Second slack (ms)

Conclusion Michael Chow44

Questions Michael Chow45

The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch.

Similar presentations

Presentation on theme: "The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch.

Similar presentations

Presentation on theme: "The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch."— Presentation transcript:

Similar presentations

About project

Feedback