Metrics at scale @UBER Mantas Klasavičius.

Name: Metrics at scale @UBER Mantas Klasavičius.
Uploaded: 2017-10-11T09:56:03+00:00
Duration: PTM7S40
Channel: Priscilla Snow
Description: Metrics at scale @UBER Mantas Klasavičius.

Metrics at Mantas Klasavičius

About Me Senior software engineer @ Uber

About Me Mantas Klasavičius
<metric_path> <value> <timestamp>

6 continents 6 years 70 countries 400 cities 5 million a day
UBER 6 continents 70 countries 400 cities 6 years 5 million a day ish engineers

15 engineers 2y ago 3 Teams: Observability Databases Foundations
UBER in Vilnius 15 engineers 3 Teams: Observability Databases Foundations 2y ago

Growth of Services, week to week
Hypergrowth defines us... Growth of Services, week to week

Metrics Metrics @UBER is a first class citizen L0 Service
Handling 500M telemetry timeseries Writing 3M values/sec and running 1K queries/sec Growing >25% month over month

Metrics Collection Graphite ~2013

Metrics Collection Graphite 2015 ~50 million whisper files
~600k value updates/s

Update graphite Netflix Atlas Metrics Collection Blueflood
Considered choices Update graphite Blueflood Netflix Atlas

Metrics Collection M3

Metrics Collection High write throughput
Cassandra is a figure of epic tradition and of tragedy. High write throughput Cassandra data model supports time series data-store - DTCS Cassandra's native TTL support

Metrics Collection Cassandra - our use case
Separate clusters for different types of data Clusters spans multiple datacenters Dynamically control to which cluster data is written Forcibly deleting old data

*. Connections.10_30_3_24.0x64d11081baa1837.*
Metrics Collection Metrics as free resource *.application_ _0361.* *. Connections.10_30_3_24.0x64d11081baa1837.* *. ply_1b09f59b-a3cf-4b9a-99b4-93e8eb16722c.* *. check-<uid_or_uuid>.*

Metrics Collection Cost accounting - metrics about metrics

Metrics Visualization
M3 - Querying

Metrics Visualization
Grafana

Metrics Visualization M3QL - Query Like It’s Bash
aggregate = fillNulls target | sum; fetch name:requests.errors caller:cn | aggregate | asPercent (fetch name:requests caller:cn | aggregate) | anomalies | sort max | tail 10 tail( sort( anomalies( asPercent( sum(fillNulls(stats.counts.cn.*.requests.errors)), sum(fillNulls(stats.counts.cn.*.requests)) ), max ) ), 10

Metrics Visualization Graphite Way vs. M3QL

Alerting based on metrics Query Based Alerting
To complement our new timeseries database, we built our own alert configuration front- end. It alleviated many of the scale problems we had with Nagios, kept queries flexible, added pluggable outputs, and gave us a place to add future functionality. graphite.absolute_threshold( ‘scale(sumSeries(transformNull(stats.*.counts.api.velocity_filter.uber.views.*.*.blocked, 0)), 0.1)’, alias=’velocity filter blocked requests’, warning_over=0.1, critical_over=10.0, )

Alerting based on metrics Classic Thresholding
Classic high / low thresholds have some intrinsic problems. Labor-intensive: each threshold is hand-tuned and manually updated. Too sensitive: hard to set thresholds for metrics with large fluctuations, even if there’s an obvious pattern. Not sensitive enough: thresholds take a long time to catch slow degradations. Poor UX: configuring really good alerts requires specialized knowledge of the Graphite query language. No guidance: system doesn’t offer automated root cause exploration.

Alerting based on metrics Intelligent Monitoring
Zero config: thresholds are set and maintained automatically. Dynamic adjustment: thresholds cope with noise, underlying growth, seasonality and rollouts. Rapid detection: embarrassingly parallel algorithm is efficient enough for minute-by-minute analysis at scale. Integrated UX: work within our existing telemetry and alert configuration systems. Helpful: automated root cause analysis. In short, the only input is a list of business-critical metrics.

Alerting based on metrics Dynamic Thresholds
The max lower threshold exceeds the min upper threshold

Alerting based on metrics Outage Detection
< 1% outages missed. 6.5 out of 10 alerts are true issues.

Alerting based on metrics F3
stats.foo anomalies(stats.foo)

On-Call Dashboard

Tracing Richer data collection
Zipkin-style distributed tracing (opentracing.io) Event markers (non-timeseries context) Automated root cause analysis

Thank You!

Metrics at scale @UBER Mantas Klasavičius.

Similar presentations

Presentation on theme: "Metrics at scale @UBER Mantas Klasavičius."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Metrics at scale @UBER Mantas Klasavičius.

Similar presentations

Presentation on theme: "Metrics at scale @UBER Mantas Klasavičius."— Presentation transcript:

Similar presentations

About project

Feedback