Experiences with Tracing Causality in Networked Services Rodrigo Fonseca, Brown Michael Freedman, Princeton George Porter, UCSD April 2010 INM/WREN San.

Slides:



Advertisements
Similar presentations
COM vs. CORBA.
Advertisements

The Top 10 Reasons Why Federated Can’t Succeed And Why it Will Anyway.
Networking Problems in Cloud Computing Projects. 2 Kickass: Implementation PROJECT 1.
Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.
Rake: Semantics Assisted Network- based Tracing Framework Yao Zhao (Bell Labs), Yinzhi Cao, Yan Chen, Ming Zhang (MSR) and Anup Goyal (Yahoo! Inc.) Presenter:
A Java Architecture for the Internet of Things Noel Poore, Architect Pete St. Pierre, Product Manager Java Platform Group, Internet of Things September.
An Automata-based Approach to Testing Properties in Event Traces H. Hallal, S. Boroday, A. Ulrich, A. Petrenko Sophia Antipolis, France, May 2003.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
6/4/2015Page 1 Enterprise Service Bus (ESB) B. Ramamurthy.
Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu, Michael J. Freedman, Jennifer Rexford Princeton University.
Notes to the presenter. I would like to thank Jim Waldo, Jon Bostrom, and Dennis Govoni. They helped me put this presentation together for the field.
UC Berkeley 1 Experiences with X-Trace: an end-to-end, datapath tracing framework George Porter, Rodrigo Fonseca, Matei Zaharia, Andy Konwinski, Randy.
OCT1 Principles From Chapter One of “Distributed Systems Concepts and Design”
Performance Management 1 Performance, Scalability and Management  Topic  Refining the object model  Threading models  Distributed callbacks  Iterators.
Circuit & Application Level Gateways CS-431 Dick Steflik.
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
.NET Mobile Application Development Remote Procedure Call.
Client Server Model and Software Design TCP/IP allows a programmer to establish communication between two application and to pass data back and forth.
Understanding and Managing WebSphere V5
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
A Framework for Smart Proxies and Interceptors in RMI Nuno Santos P. Marques, L. Silva CISUC, University of Coimbra, Portugal
Course 6421A Module 7: Installing, Configuring, and Troubleshooting the Network Policy Server Role Service Presentation: 60 minutes Lab: 60 minutes Module.
1 Content Distribution Networks. 2 Replication Issues Request distribution: how to transparently distribute requests for content among replication servers.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
 Zhichun Li  The Robust and Secure Systems group at NEC Research Labs  Northwestern University  Tsinghua University 2.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
13/09/2015 Michael Chai; Behrouz Forouzan Staffordshire University School of Computing Transport layer and Application Layer Slide 1.
Using Queries for Distributed Monitoring and Forensics Atul Singh Rice University Peter Druschel Max Planck Institute for Software Systems Timothy Roscoe.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Objectives Configure routing in Windows Server 2008 Configure Routing and Remote Access Services in Windows Server 2008 Network Address Translation 1.
1 CS 501 Spring 2003 CS 501: Software Engineering Lecture 16 System Architecture and Design II.
Presented by Xiaoyu Qin Virtualized Access Control & Firewall Virtualization.
Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace.
©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang.
Orbited Scaling Bi-directional web applications A presentation by Michael Carter
Architectures of distributed systems Fundamental Models
MapReduce How to painlessly process terabytes of data.
1 Introduction to Middleware. 2 Outline What is middleware? Purpose and origin Why use it? What Middleware does? Technical details Middleware services.
Wir schaffen Wissen – heute für morgen Gateway (Redux) PSI - GFA Controls IT Alain Bertrand Renata Krempaska, Hubert Lutz, Matteo Provenzano, Dirk Zimoch.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Usenix Annual Conference, Freenix track – June 2004 – 1 : Flexible Database Clustering Middleware Emmanuel Cecchet – INRIA Julie Marguerite.
ArcGIS Server for Administrators
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Client Call Back Client Call Back is useful for multiple clients to keep up to date about changes on the server Example: One auction server and several.
3.14 Work List IOC Core Channel Access. Changes to IOC Core Online add/delete of record instances Tool to support online add/delete OS independent layer.
Jan 30, 2001CSCI {4,6}900: Ubiquitous Computing1 Announcements Project Milestone 2 due today. Undergraduate projects should have 3 students per project.
Windows Azure Virtual Machines Anton Boyko. A Continuous Offering From Private to Public Cloud.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
CS533 - Concepts of Operating Systems 1 Threads, Events, and Reactive Objects - Alan West.
Source Level Debugging of Parallel Programs Roland Wismüller LRR-TUM, TU München Germany.
EPICS and LabVIEW Tony Vento, National Instruments
AFS/OSD Project R.Belloni, L.Giammarino, A.Maslennikov, G.Palumbo, H.Reuter, R.Toebbicke.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
KYUNG-HWA KIM HENNING SCHULZRINNE 12/09/2008 INTERNET REAL-TIME LAB, COLUMBIA UNIVERSITY DYSWIS.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
R Some of these slides are from Prof Frank Lin SJSU. r Minor modifications are made. 1.
Distributed Computing & Embedded Systems Chapter 4: Remote Method Invocation Dr. Umair Ali Khan.
 Cloud Computing technology basics Platform Evolution Advantages  Microsoft Windows Azure technology basics Windows Azure – A Lap around the platform.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Core and Framework DIRAC Workshop October Marseille.
Using Correlated Tracing to Diagnose Query Level Performance What’s slowing down my app? Jerome Halmans Senior Software Development Engineer Microsoft.
Monitoring Dynamic IOC Installations Using the alive Record Dohn Arms Beamline Controls & Data Acquisition Group Advanced Photon Source.
Integrating HA Legacy Products into OpenSAF based system
Nope OS Prepared by, Project Guides: Ms. Divya K V Ms. Jucy Vareed
The Top 10 Reasons Why Federated Can’t Succeed
Enterprise Service Bus (ESB) (Chapter 9)
Mobile Agents.
Client/Server Computing and Web Technologies
Presentation transcript:

Experiences with Tracing Causality in Networked Services Rodrigo Fonseca, Brown Michael Freedman, Princeton George Porter, UCSD April 2010 INM/WREN San Jose, CA

Which way to Bangalore?

Troubleshooting Networked Systems Hard to develop, debug, deploy, troubleshoot No standard way to integrate debugging, monitoring, diagnostics

Status quo: device centric :55:38 PM fire :55:39 PM fire [04:03: ] [notice] Dispatch s1... [04:03: ] [notice] Dispatch s2... [04:04: ] [notice] Dispatch s3... [04:07: ] [notice] Dispatch s1... [04:10: ] [notice] Dispatch s2... [04:03: ] [notice] Dispatch s3... [04:04: ] [crit] Server s3 down [20/Aug/2006:09:12: ] "GET /ga [20/Aug/2006:09:13: ] "GET /rob [20/Aug/2006:09:13: ] "GET /gal [20/Aug/2006:09:15: ] "GET /ga [20/Aug/2006:09:15: ] "GET /ga [20/Aug/2006:09:15: ] "GET /ro [20/Aug/2006:09:15: ] "GET /ga [20/Aug/2006:09:12: ] "GET /ga [20/Aug/2006:09:13: ] "GET /rob [20/Aug/2006:09:13: ] "GET /gal [20/Aug/2006:09:15: ] "GET /ga [20/Aug/2006:09:15: ] "GET /ga [20/Aug/2006:09:15: ] "GET /ro [20/Aug/2006:09:15: ] "GET /ga... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid Firewall Load Balancer Web 1 Web 2 Database

Status quo: device centric Determining paths: – Join logs on time and ad-hoc identifiers Relies on – well synchronized clocks – extensive application knowledge Requires all operations logged to guarantee complete paths

This talk Causality Tracking: an alternative Many previous frameworks: – X-Trace, PIP, Whodunit, Magpie, Google’s Dapper… Experiences integrating and using X-Trace

Outline Tracing causality with X-Trace Case studies – 802.1X Authentication Service – CoralCDN and OASIS anycast service Challenges Conclusion

Outline Tracing causality with X-Trace Case studies – 802.1X Authentication Service – CoralCDN and OASIS anycast service Challenges Conclusion

X-Trace X-Trace records events in a distributed execution and their causal relationship Events are grouped into tasks – Well defined starting event and all that is causally related Each event generates a report, binding it to one or more preceding events Captures full happens-before relation

X-Trace Output Task graph capturing task execution – Nodes: events across layers, devices – Edges: causal relations between events IP Router IP Router IP TCP 1 Start TCP 1 End IP Router IP TCP 2 Start TCP 2 End HTTP Proxy HTTP Server HTTP Client

Each event uniquely identified within a task: [TaskId, EventId] [TaskId, EventId] propagated along execution path For each event create and log an X-Trace report – Enough info to reconstruct the task graph Basic Mechanism IP Router IP Router IP TCP 1 Start TCP 1 End IP Router IP TCP 2 Start TCP 2 End HTTP Proxy HTTP Server HTTP Client f h b a g m n cde ijkl [T, g][T, a] X-Trace Report TaskID: T EventID: g Edge: from a, f X-Trace Report TaskID: T EventID: g Edge: from a, f

X-Trace Library API Handles propagation within app Threads / event-based (e.g., libasync) Akin to a logging API: – Main call is logEvent(message) Library takes care of event id creation, binding, reporting, etc Implementations in C++, Java, Ruby, Javascript

Outline Tracing causality with X-Trace Case studies – 802.1X Authentication Service – CoralCDN and OASIS anycast service Challenges Conclusion

802.1X Authentication Service Client Authenticator e.g. Acc. Point Auth Server RADIUS Identity Store e.g. LDAP EAP L2 RADIUS Over UDP LDAP Identified 5 common authentication issues from vendor logs Added a few X-Trace instrumentation points sufficient to differentiate these faults Introduced faults in a test environment

802.1X Authentication Service Instrumentation was easy: – Nested invocations – No in-task concurrency – Extensible protocols (RADIUS, LDAP) – Modular, request-oriented server software

802.1X Example Faults Misconfigured Firewall: no LDAP Miscalibrated Timeout Value Key: multiple correlated vantage points Can help tune timeout values Key: multiple correlated vantage points Can help tune timeout values

CoralCDN and OASIS Instrumented production deployment Heavy use of sampling: – 0.1% of requests to CoralCDN traced Leveraged libasync, libarpc X-Trace instrumentation Much more complex program flow – E.g. windowed parallel RPC calls, variable timeouts Found bugs, performance problems, clock skews…

CoralCDN CoralCDN Distributed HTTP Cache

1KB/s 10KB/s 100KB/s 1MB/s 189s: Linux TCP Timeout connecting to origin Slow connection Proxy -> Client Slow connection Origin -> Proxy Timeout in RPC, due to slow Planetlab node! Same symptoms, very different causes 189 seconds CoralCDN Response Times

Outline Brief X-Trace Intro Case studies – 802.1X Authentication Service – CoralCDN – OASIS Anycast Service Challenges Conclusion

Hidden Channels Example: CoralCDN DNS Calls foo DNS Resolver Send Receive * resolve(foo,*) Tasks A B C DNS resolve In general: deferral structures – E.g., queues, thread pools, continuations – Store metadata with the structure Often encapsulated in libraries, high leverage

Incidental vs. Semantic Concurrency Forks and joins tricky for naïve instrumentation – Non-intuitive fork – Incorrect join startdo(1) do(2) do(3)end done(1)done(2)done(3)

Incidental vs. Semantic Concurrency Extra code annotation fixes the problem – Manually change parent of do() events – Manually add edges from done() to end start do(1) do(2) do(3) end done(1) done(2) done(3)

Dealing with Black Boxes Ideal scenario: all components instrumented with X-Trace – Log all events clientproxyserver

Dealing with Black Boxes Gray-box proxy: passes X-Trace metadata on – Log events on the client and server – Layering does this automatically clientproxyserver

Dealing with Black Boxes clientproxyserver Black box proxy: drops X-Trace metadata – No X-Trace events on proxy or server – Can always trace around black box, in client

Outline Brief X-Trace Intro Case studies – 802.1X Authentication Service – CoralCDN – OASIS Anycast Service Challenges Conclusion

Revisiting Troubleshooting Device-centric Logs Depends on well sync’d clocks Joins on ad-hoc identifiers Needs all ops logged for complete traces No modifications to existing code Task-centric traces Does not depend on clocks (can actually fix them) Deterministic joins on standardized ids Sample-based tracing possible Requires instrumentation

X-Trace Instrumentation Instrumenting is easy in most cases A few key libraries go a long way Can be done iteratively – Refining expectations (a la Pip) Partial annotation still useful Independent instrumentation feasible Huge benefits

Conclusions Simple, uniform task graphs useful in debugging, troubleshooting, diagnostics Instrumentation is feasible Causal tracing should be a first-class concept in networked systems

Thank you More details on paper For more info: