Presentation is loading. Please wait.

Presentation is loading. Please wait.

Metrics at scale @UBER Mantas Klasavičius.

Similar presentations


Presentation on theme: "Metrics at scale @UBER Mantas Klasavičius."— Presentation transcript:

1 Metrics at Mantas Klasavičius

2 About Me Senior software engineer @ Uber

3 About Me Mantas Klasavičius
<metric_path> <value> <timestamp>

4 6 continents 6 years 70 countries 400 cities 5 million a day
UBER 6 continents 70 countries 400 cities 6 years 5 million a day ish engineers

5 15 engineers 2y ago 3 Teams: Observability Databases Foundations
UBER in Vilnius 15 engineers 3 Teams: Observability Databases Foundations 2y ago

6 Growth of Services, week to week
Hypergrowth defines us... Growth of Services, week to week

7 Metrics Metrics @UBER is a first class citizen L0 Service
Handling 500M telemetry timeseries Writing 3M values/sec and running 1K queries/sec Growing >25% month over month

8 Metrics Collection Graphite ~2013

9 Metrics Collection Graphite 2015 ~50 million whisper files
~600k value updates/s

10 Update graphite Netflix Atlas Metrics Collection Blueflood
Considered choices Update graphite Blueflood Netflix Atlas

11 Metrics Collection M3

12 Metrics Collection M3

13 Metrics Collection High write throughput
Cassandra is a figure of epic tradition and of tragedy. High write throughput Cassandra data model supports time series data-store - DTCS Cassandra's native TTL support

14 Metrics Collection Cassandra - our use case
Separate clusters for different types of data Clusters spans multiple datacenters Dynamically control to which cluster data is written Forcibly deleting old data

15 *. Connections.10_30_3_24.0x64d11081baa1837.*
Metrics Collection Metrics as free resource *.application_ _0361.* *. Connections.10_30_3_24.0x64d11081baa1837.* *. ply_1b09f59b-a3cf-4b9a-99b4-93e8eb16722c.* *. check-<uid_or_uuid>.*

16 Metrics Collection Cost accounting - metrics about metrics

17 Metrics Visualization
M3 - Querying

18 Metrics Visualization
Grafana

19 Metrics Visualization M3QL - Query Like It’s Bash
aggregate = fillNulls target | sum; fetch name:requests.errors caller:cn | aggregate | asPercent (fetch name:requests caller:cn | aggregate) | anomalies | sort max | tail 10 tail( sort( anomalies( asPercent( sum(fillNulls(stats.counts.cn.*.requests.errors)), sum(fillNulls(stats.counts.cn.*.requests)) ), max ) ), 10

20 Metrics Visualization Graphite Way vs. M3QL

21 Alerting based on metrics Query Based Alerting
To complement our new timeseries database, we built our own alert configuration front- end. It alleviated many of the scale problems we had with Nagios, kept queries flexible, added pluggable outputs, and gave us a place to add future functionality. graphite.absolute_threshold( ‘scale(sumSeries(transformNull(stats.*.counts.api.velocity_filter.uber.views.*.*.blocked, 0)), 0.1)’, alias=’velocity filter blocked requests’, warning_over=0.1, critical_over=10.0, )

22 Alerting based on metrics Classic Thresholding
Classic high / low thresholds have some intrinsic problems. Labor-intensive: each threshold is hand-tuned and manually updated. Too sensitive: hard to set thresholds for metrics with large fluctuations, even if there’s an obvious pattern. Not sensitive enough: thresholds take a long time to catch slow degradations. Poor UX: configuring really good alerts requires specialized knowledge of the Graphite query language. No guidance: system doesn’t offer automated root cause exploration.

23 Alerting based on metrics Intelligent Monitoring
Zero config: thresholds are set and maintained automatically. Dynamic adjustment: thresholds cope with noise, underlying growth, seasonality and rollouts. Rapid detection: embarrassingly parallel algorithm is efficient enough for minute-by-minute analysis at scale. Integrated UX: work within our existing telemetry and alert configuration systems. Helpful: automated root cause analysis. In short, the only input is a list of business-critical metrics.

24 Alerting based on metrics Dynamic Thresholds
The max lower threshold exceeds the min upper threshold

25 Alerting based on metrics Outage Detection
< 1% outages missed. 6.5 out of 10 alerts are true issues.

26 Alerting based on metrics F3
stats.foo anomalies(stats.foo)

27 On-Call Dashboard

28 Tracing Richer data collection
Zipkin-style distributed tracing (opentracing.io) Event markers (non-timeseries context) Automated root cause analysis

29 Thank You!


Download ppt "Metrics at scale @UBER Mantas Klasavičius."

Similar presentations


Ads by Google