Distributed Tracing How to do latency analysis for microservice-based applications Reshmi Krishna @reshmi9k
About Me Software Engineer Platform Architect, Pivotal Women In Tech Community Members Twitter : @reshmi9k MeetUp : Cloud-Native-New-York
Agenda Distributed Tracing Tracers and Tracing Systems Zipkin Incorporating distributed tracing into an existing micro service Demo
From Monolith …. Customer Loyalty Web Frontend Payment Notifications A monolith usually looks like a big ball of mud with entangled dependencies, lack of cohesion, direct DB queries instead of using interfaces and APIs. It does NOT do one thing very well. It usually does a lot of things, which become brittle and difficult to reason on. All functionality must be deployed together No Language and framework heterogeneity More likely a failure will cascade resulting in a reliance reduction - brittle - high risk deployment Scale vertically or limited horizontal scaling of everything at once Large team - anti agile Harder to reuse Harder to modify - thousands of lines of hard to understand code Harder to replace - meantime to recovery is limited Getting up to speed Wikipedia: A big ball of mud is a software system that lacks a perceivable architecture. Although undesirable from a software engineering point of view, such systems are common in practice due to business pressures, developer turnover and code entropy. They are a type of design anti-pattern. Loyalty Web Frontend Payment Notifications
To Microservices . Death Star architecture by Adrian Cockcroft As visualized by App Dynamics, Boundary.com and Twitter internal tools
Troubleshooting Latency issues When was the event? How long did it take? How do I know it was slow? Why did it take so long? Which microservice was responsible?
Distributed Tracing Distributed Tracing is a process of collecting end-to-end transaction graphs in near real time A trace represents the entire journey of a request A span represents single operation call Distributed Tracing Systems are often used for this purpose. Zipkin is an example As a request is flowing from one microservice to another, tracers add logic to create unique trace Id, span Id A trace represents the entire journey of a request A span is a basic unit of work Span id is identified by an unique 64-bit id Trace id is identified by a 64-bit id, which the span is part of A span contains timestamped records, any RPC timing data, and zero or more application-specific annotations The trace give u the structure through which you can identify your calls. You can you can think about trace as a tree and the tree nodes as spans. The edges indicate a casual relationship between a span and its parent span. Independent of its place in a larger trace tree, though, a span is also a simple log of timestamped records which encode the span’s start and end time, any RPC timing data, and zero or more application-specific annotations
Visualization - Traces & Spans UI Trace Id : 1, Span Id : 1 Back-Office-Microservice Trace Id : 1, Parent Id : 1, Span Id : 2 Customer-Microservice Trace Id : 1, Parent Id : 2, Span Id : 4 Account-Microservice Trace Id : 1, Parent Id : 2, Span Id : 5
Dapper Paper By Google @reshmi9k This paper described Dapper, which is Google’s production distributed systems tracing infrastructure Design Goals : Low overhead Application-level transparency Scalability Dapper was published in 2010 http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36356.pdf
Zipkin Zipkin is a distributed tracing system Implementation based on Dapper paper, Google Aggregate spans into trace trees Manages both collection and lookup of the data In 2015, OpenZipkin became the primary fork Zipkin is a distributed tracing system. It helps gather timing data needed to troubleshoot latency problems in microservice architectures. It manages both the collection and lookup of this data. Zipkin’s design is based on the Google Dapper paper. Started as a project in first hack week. Initial version of Dapper paper was implemented for Thrift Today it has grown to include support for tracing Http, Thrift, Memcache, SQL and Redis requests. The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
Initial Zipkin Architecture Tracers collect timing data and transport it over HTTP or Kafka. We use Scribe to transport all the traces from the different services to Zipkin and Hadoop. Scribe was developed by Facebook and it’s made up of a daemon that can run on each server in your system. It listens for log messages and routes them to the correct receiver depending on the category. Once the trace data arrives at the Zipkin collector daemon we check that it’s valid, store it and the index it for lookups. Zipkin was originally built with Cassandra for storage. It was scalable, had a flexible schema, and is heavily used within Twitter. However, this component is now pluggable, and now we have support for Redis, HBase, MySQL, PostgreSQL, SQLite, and H2. Users query for traces via Zipkin’s Web UI or Api.
Tracers Tracers add logic to create unique trace ID Trace ID is generated when the first request is made Span ID is generated as the request arrives at each microservice Example tracer is Spring Cloud Sleuth Tracers execute in your production apps! They are written to not log too much Tracers have instrumentation or sampling policy
Demo : Architecture Diagram Transport Mq/Http/Log Tracers add logic to create unique trace ID Trace ID is generated when the first request is made Span Id is generated as the request arrives at each microservice Example tracer is Spring Cloud Sleuth Tracers execute in your production apps! They are written to not log too much Tracers have instrumentation or sampling policy to manage volumes of traces and spans ZIPKIN Collector Spring Cloud Sleuth APP APP Spring Cloud Sleuth Spring Cloud Sleuth APP Query Server Zipkin UI Span Store Spring Cloud Sleuth APP
Let’s look at some code & Demo
Summary Distributed tracing allows you to quickly see latency issues in your system Zipkin is a great tool to visualize the latency graph and system dependencies Spring Cloud Sleuth integrates with Zipkin and grants you log correlation Log correlation allows you to match logs for a given trace Pivotal Cloud Foundry makes integration of your apps and Spring Cloud Sleuth and Zipkin easier
Links Dapper, Google : http://research.google.com/pubs/pub36356.html Code for this presentation : https://github.com/reshmik/DistributedTracingDemo_Velocity2016.git Sleuth’s documentation: http://cloud.spring.io/spring-cloud-sleuth/spring-cloud-sleuth.html Repo with Spring Boot Zipkin server: https://github.com/openzipkin/zipkin-reporter-java.git Zipkin deployed as an PCF :https://github.com/reshmik/Zipkin/tree/master/spring-cloud-sleuth- samples/spring-cloud-sleuth-sample-zipkin-stream Pivotal Web Services trial : https://run.pivotal.io/ PivotalCloudFoundry on your laptop : https://docs.pivotal.io/pcf-dev/ @reshmi9k