Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference.

Slides:

Advertisements

Similar presentations

Transaction Management: Concurrency Control CS634 Class 17, Apr 7, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Advertisements

CS3771 Today: deadlock detection and election algorithms  Previous class Event ordering in distributed systems Various approaches for Mutual Exclusion.

1 Distributed Deadlock Fall DS Deadlock Topics Prevention –Too expensive in time and network traffic in a distributed system Avoidance.

1 CS 201 Compiler Construction Machine Code Generation.

Chair of Software Engineering From Program slicing to Abstract Interpretation Dr. Manuel Oriol.

UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.

Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.

CLUE: SYSTEM TRACE ANALYTICS FOR CLOUD SERVICE PERFORMANCE DIAGNOSIS Hui Zhang 1, Junghwan Rhee 1, Nipun Arora 1, Sahan Gamage 2, Guofei Jiang 1, Kenji.

Managing Data Resources

PTIDES: Programming Temporally Integrated Distributed Embedded Systems Yang Zhao, EECS, UC Berkeley Edward A. Lee, EECS, UC Berkeley Jie Liu, Microsoft.

Ordering and Consistent Cuts Presented By Biswanath Panda.

CS 290C: Formal Models for Web Software Lecture 10: Language Based Modeling and Analysis of Navigation Errors Instructor: Tevfik Bultan.

Causality Interface  Declares the dependency that output events have on input events.  D is an ordered set associated with the min ( ) and plus ( ) operators.

University of Kansas Construction & Integration of Distributed Systems Jerry James Oct. 30, 2000.

Reliability and Partition Types of Failures 1.Node failure 2.Communication line of failure 3.Loss of a message (or transaction) 4.Network partition 5.Any.

A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.

Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.

Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.

Introduction to Computer Networks 09/23 Presenter: Fatemah Panahi.

Created by the Community for the Community Building a RFID solution in BTS 09.

Microsoft ® Official Course Monitoring and Troubleshooting Custom SharePoint Solutions SharePoint Practice Microsoft SharePoint 2013.

Query Processing Presented by Aung S. Win.

Detection and Resolution of Anomalies in Firewall Policy Rules

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 13 Slide 1 Application architectures.

272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.

CSE 486/586 CSE 486/586 Distributed Systems PA Best Practices Steve Ko Computer Sciences and Engineering University at Buffalo.

System/Software Testing

Introduction to the Enterprise Library. Sounds familiar? Writing a component to encapsulate data access Building a component that allows you to log errors.

1 CS 201 Compiler Construction Data Flow Analysis.

GrIDS -- A Graph Based Intrusion Detection System For Large Networks Paper by S. Staniford-Chen et. al.

Finite State Machines. Binary encoded state machines –The number of flip-flops is the smallest number m such that 2 m  n, where n is the number of states.

INTRODUCTION TO COMPUTING CHAPTER NO. 06. Compilers and Language Translation Introduction The Compilation Process Phase 1 – Lexical Analysis Phase 2 –

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

VeriFlow: Verifying Network-Wide Invariants in Real Time

Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.

LiveCycle Data Services Introduction Part 2. Part 2? This is the second in our series on LiveCycle Data Services. If you missed our first presentation,

Transparent Grid Enablement Using Transparent Shaping and GRID superscalar I. Description and Motivation II. Background Information: Transparent Shaping.

Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.

1 Software Reliability Assurance for Real-time Systems Joel Henry, Ph.D. University of Montana NASA Software Assurance Symposium September 4, 2002.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.

CONTI'20041 Event Management in Distributed Control Systems Gheorghe Sebestyen Technical University of Cluj-Napoca Computers Department.

Chapter 12: Design Phase n 12.1 Design and Abstraction n 12.2 Action-Oriented Design n 12.3 Data Flow Analysis n Data Flow Analysis Example n

Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

DEV333 Instrumenting Applications for Manageability with the Enterprise Instrumentation Framework David Keogh Program Manager Visual Studio Enterprise.

Computer Science Automated Software Engineering Research ( Mining Exception-Handling Rules as Conditional Association.

Databases Illuminated

Lucy Yong Young Lee IETF CCAMP WG GMPLS Extension for Reservation and Time based Bandwidth Service.

1 11 Channel Assignment for Maximum Throughput in Multi-Channel Access Point Networks Xiang Luo, Raj Iyengar and Koushik Kar Rensselaer Polytechnic Institute.

Parallel and Distributed Simulation Distributed Virtual Environments (DVE) & Software Introduction.

ICS 313: Programming Language Theory Chapter 13: Concurrency.

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.

1 Computer Communication & Networks Lecture 21 Network Layer: Delivery, Forwarding, Routing Waleed.

1 Object Oriented Logic Programming as an Agent Building Infrastructure Oct 12, 2002 Copyright © 2002, Paul Tarau Paul Tarau University of North Texas.

13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.

Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.

CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.

1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.

Source Level Debugging of Parallel Programs Roland Wismüller LRR-TUM, TU München Germany.

A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler.

Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.

1 Chapter 11 Global Properties (Distributed Termination)

CS3771 Today: Distributed Coordination  Previous class: Distributed File Systems Issues: Naming Strategies: Absolute Names, Mount Points (logical connection.

Introduction to Computer Programming Concepts M. Uyguroğlu R. Uyguroğlu.

CSE 486/586 CSE 486/586 Distributed Systems PA Best Practices Steve Ko Computer Sciences and Engineering University at Buffalo.

Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.

Managing Data Resources File Organization and databases for business information systems.

SDN Network Updates Minimum updates within a single switch

Presentation transcript:

Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference for Distributed Applications Jinlin Yang Center for Software Excellence Microsoft Corp th IEEE International Symposium on Reliable Distributed Systems

Introduction New Challenges to reliability as applications move to Cloud Distinct corporate entities managing the infrastructure and the owing the application deployed Application developer do not have access to lower level debugging information in case of failures/faults. Depends on Application output or app level custom Logs for diagnosis Goal: Describe the high-level structural view of a distributed program execution to facilitate easy “after the fact” diagnosis.

Contributions Define abstraction for representing distributed executions – “Tasks” A lightweight approach to generate “Task Graphs” from the application event logs. A declarative formulation of the rules to generate Task Graphs using Prolog. Demonstrate use of Task Graph to help understand the distributed execution including anomaly detection.

Relevance to SmartGrid and CiC Extensions Fault Detection by real-time log processing (CEP?) The patterns for CEP can be defined by the application developer OR can be auto-generated using code augmentation and static code analysis. On fault-detection, the task graph can be used to decide “recovery” mechanisms (other than naïve restart process strategy) Shortcomings Do not explicitly consider the “Data Repository” Considered only as one of the ‘tasks’. Not sure how it handles Transactions

Definitions Event: is the execution of an operation that sends (or receives) data/signal to a different thread/process (Smallest building blocks) Signaling Event: is the operation of Sending Acting Event: is the operation of Receiving Happens Before (a  e b): partial ordering of events. A is the Sender and B is the receiver who acts on that signal. Task: Autonomous computation within a thread between to “acting” events. [A start, A end ) Task contains exactly one Acting Event Zero or more Signaling Event Task Graph: A DAG whose nodes are tasks and edges represent Happens Before relations A Request: A pair of signaling and acting events, where the signaling event is originating from outside the System. A Reply: A pair of signaling and acting events, where the Acting event is triggered outside the System. E2E service Graph:

Example

System Setup Uses HDFS as the example application on Cloud HDFS logs are not sufficient/standardized Uses Instrumentation using a tool called “AspectJ” AspectJ lets the developer insert code based on specific “rules” during compilation Each event is logged as a 7-field tuple (EventID, ProcID, threadID, SourceLocation, Type, Tag, Value)

Constructing Task Graphs (Prolog formulation) - I Events A “Fact” to parse and store all events An entry for hb is made only if the Rules on the right are true for events X & Y

Constructing Task Graphs (Prolog formulation) - II Tasks

Issues & Solutions - I False +ves caused by Common Sycn Objects Notion of “Time” is required. But Global Clocks or Vector Clocks are expensive and complex. Heuristic: Use the order of events in the event logs. Problem: Proposed Solution:

Issues & Solutions - II False +ves caused by Communication Multiple Writes on the same Socket. Heuristic: Use “Packet Size” and Total Received so far to decide which write to associate to which reads. Problem: Proposed Solution:

Issues & Solutions - III False -ves caused by Gaurded Waits Multiple waiting threads are notified and the Lock Condition is updated before the current thread’s execution. Hence a Condition Check is required after waking up. Manually update such cases and remove augmented code within the loop and Add a marker just after the loop. Problem: Proposed Solution:

Evaluation - I Performance Impact Runtime: 22.2% increase in binary size 38% increase in execution time TaskGraph building using Prolog:

Evaluation – II (Demo) To Help a new HDFS developer to analyze HDFS Execution

Relevance to SmartGrid and CiC Extensions Fault Detection by real-time log processing (CEP?) The patterns for CEP can be defined by the application developer OR can be auto-generated using code augmentation and static code analysis. On fault-detection, the task graph can be used to decide “recovery” mechanisms (other than naïve restart process strategy) Shortcomings Do not explicitly consider the “Data Repository” Considered only as one of the ‘tasks’. Not sure how it handles Transactions