Download presentation
1
Metrics at Mantas Klasavičius
2
About Me Senior software engineer @ Uber
3
About Me Mantas Klasavičius
<metric_path> <value> <timestamp>
4
6 continents 6 years 70 countries 400 cities 5 million a day
UBER 6 continents 70 countries 400 cities 6 years 5 million a day ish engineers
5
15 engineers 2y ago 3 Teams: Observability Databases Foundations
UBER in Vilnius 15 engineers 3 Teams: Observability Databases Foundations 2y ago
6
Growth of Services, week to week
Hypergrowth defines us... Growth of Services, week to week
7
Metrics Metrics @UBER is a first class citizen L0 Service
Handling 500M telemetry timeseries Writing 3M values/sec and running 1K queries/sec Growing >25% month over month
8
Metrics Collection Graphite ~2013
9
Metrics Collection Graphite 2015 ~50 million whisper files
~600k value updates/s
10
Update graphite Netflix Atlas Metrics Collection Blueflood
Considered choices Update graphite Blueflood Netflix Atlas
11
Metrics Collection M3
12
Metrics Collection M3
13
Metrics Collection High write throughput
Cassandra is a figure of epic tradition and of tragedy. High write throughput Cassandra data model supports time series data-store - DTCS Cassandra's native TTL support
14
Metrics Collection Cassandra - our use case
Separate clusters for different types of data Clusters spans multiple datacenters Dynamically control to which cluster data is written Forcibly deleting old data
15
*. Connections.10_30_3_24.0x64d11081baa1837.*
Metrics Collection Metrics as free resource *.application_ _0361.* *. Connections.10_30_3_24.0x64d11081baa1837.* *. ply_1b09f59b-a3cf-4b9a-99b4-93e8eb16722c.* *. check-<uid_or_uuid>.*
16
Metrics Collection Cost accounting - metrics about metrics
17
Metrics Visualization
M3 - Querying
18
Metrics Visualization
Grafana
19
Metrics Visualization M3QL - Query Like It’s Bash
aggregate = fillNulls target | sum; fetch name:requests.errors caller:cn | aggregate | asPercent (fetch name:requests caller:cn | aggregate) | anomalies | sort max | tail 10 tail( sort( anomalies( asPercent( sum(fillNulls(stats.counts.cn.*.requests.errors)), sum(fillNulls(stats.counts.cn.*.requests)) ), max ) ), 10
20
Metrics Visualization Graphite Way vs. M3QL
21
Alerting based on metrics Query Based Alerting
To complement our new timeseries database, we built our own alert configuration front- end. It alleviated many of the scale problems we had with Nagios, kept queries flexible, added pluggable outputs, and gave us a place to add future functionality. graphite.absolute_threshold( ‘scale(sumSeries(transformNull(stats.*.counts.api.velocity_filter.uber.views.*.*.blocked, 0)), 0.1)’, alias=’velocity filter blocked requests’, warning_over=0.1, critical_over=10.0, )
22
Alerting based on metrics Classic Thresholding
Classic high / low thresholds have some intrinsic problems. Labor-intensive: each threshold is hand-tuned and manually updated. Too sensitive: hard to set thresholds for metrics with large fluctuations, even if there’s an obvious pattern. Not sensitive enough: thresholds take a long time to catch slow degradations. Poor UX: configuring really good alerts requires specialized knowledge of the Graphite query language. No guidance: system doesn’t offer automated root cause exploration.
23
Alerting based on metrics Intelligent Monitoring
Zero config: thresholds are set and maintained automatically. Dynamic adjustment: thresholds cope with noise, underlying growth, seasonality and rollouts. Rapid detection: embarrassingly parallel algorithm is efficient enough for minute-by-minute analysis at scale. Integrated UX: work within our existing telemetry and alert configuration systems. Helpful: automated root cause analysis. In short, the only input is a list of business-critical metrics.
24
Alerting based on metrics Dynamic Thresholds
The max lower threshold exceeds the min upper threshold
25
Alerting based on metrics Outage Detection
< 1% outages missed. 6.5 out of 10 alerts are true issues.
26
Alerting based on metrics F3
stats.foo anomalies(stats.foo)
27
On-Call Dashboard
28
Tracing Richer data collection
Zipkin-style distributed tracing (opentracing.io) Event markers (non-timeseries context) Automated root cause analysis
29
Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.