A Runtime Verification Based Trace-Oriented Monitoring Framework for Cloud Systems Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Wei Dong.

Slides:



Advertisements
Similar presentations
A Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud Many slides from authors’ presentation.
Advertisements

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
SDN + Storage.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
From Server to Service: How Microsoft moved Team Foundation Server to Windows Azure Grant Holliday Senior Premier Field Engineer AZR323b.
Spark: Cluster Computing with Working Sets
Rake: Semantics Assisted Network- based Tracing Framework Yao Zhao (Bell Labs), Yinzhi Cao, Yan Chen, Ming Zhang (MSR) and Anup Goyal (Yahoo! Inc.) Presenter:
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
1 Introduction Introduction to database systems Database Management Systems (DBMS) Type of Databases Database Design Database Design Considerations.
AN INTRODUCTION TO CLOUD COMPUTING Web, as a Platform…
Database Management Systems (DBMS)
New Challenges in Cloud Datacenter Monitoring and Management
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
Chapter 7 Configuring & Managing Distributed File System
Microsoft ® Official Course Monitoring and Troubleshooting Custom SharePoint Solutions SharePoint Practice Microsoft SharePoint 2013.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Chapter 1 Overview of Databases and Transaction Processing.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Network and Active Directory Performance Monitoring and Troubleshooting NETW4008 Lecture 8.
Cong Wang1, Qian Wang1, Kui Ren1 and Wenjing Lou2
Cloud Computing Saneel Bidaye uni-slb2181. What is Cloud Computing? Cloud Computing refers to both the applications delivered as services over the Internet.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Software Testing. Definition To test a program is to try to make it fail.
Project 1 Online multi-user video monitoring system.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Towards An Open Data Set for Trace-Oriented Monitoring Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Michael R. Lyu 1,2 1 National University.
An Example Use Case Scenario
DISTRIBUTED COMPUTING
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
SecureMR: A Service Integrity Assurance Framework for MapReduce Author: Wei Wei, Juan Du, Ting Yu, Xiaohui Gu Source: Annual Computer Security Applications.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 10, 10/30/2003 Prof. Roy Levow.
BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing Yilei Zhang, Zibin Zheng, and Michael R. Lyu
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.
Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.
College of Computer National University of Defense Technology Jingwen Zhou, Zhenbang Chen, Haibo Mi, and Ji Wang {jwzhou, This work.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Example: Rumor Performance Evaluation Andy Wang CIS 5930 Computer Systems Performance Analysis.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Chapter 1 Overview of Databases and Transactions.
Computer Simulation of Networks ECE/CSC 777: Telecommunications Network Design Fall, 2013, Rudra Dutta.
The IEEE International Conference on Cluster Computing 2010
History & Motivations –RDBMS History & Motivations (cont’d) … … Concurrent Access Handling Failures Shared Data User.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
CS426: Building Decentralized Systems Mahesh Balakrishnan.
COT 4600 Operating Systems Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
KYUNG-HWA KIM HENNING SCHULZRINNE 12/09/2008 INTERNET REAL-TIME LAB, COLUMBIA UNIVERSITY DYSWIS.
ECE 456 Computer Architecture Lecture #9 – Input/Output Instructor: Dr. Honggang Wang Fall 2013.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
TraceBench: An Open Data Set for Trace-Oriented Monitoring Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Michael R. Lyu 1,2 1 PDL, National.
Chapter 13: Query Processing
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Function as a Service An Ad Hoc Approach to Cloud Computing By Keith Downie.
Chapter 1 Overview of Databases and Transaction Processing.
About Dreamwares Dreamwares is a web & mobile application development company specializing in Cloud Computing and has built powerful applications on Amazon.
Experience Report: System Log Analysis for Anomaly Detection
Applying Control Theory to Stream Processing Systems
Troubleshooting Network Communications
TraceBench: An Open Data Set for Trace-Oriented Monitoring
Fault Tolerance Distributed Web-based Systems
Interpret the execution mode of SQL query in F1 Query paper
Presentation transcript:

A Runtime Verification Based Trace-Oriented Monitoring Framework for Cloud Systems Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Wei Dong 1 {jwzhou, 1 PDL & College of Computer, NUDT, Changsha, China 2 Shenzhen Research Institute, CUHK, Shenzhen, China

Motivation 2

3

4

… August, 2013 meltdowns Amazon: $7,000,000/100min Google: $550,000/5min March 13-14, 2008 Malfunction in Windows Azure last 22 hours February 24, 2009 Gmail and Google Apps Engine outage last 2.5 hours January 31, 2009 Google search outage due to programming error last 40 min June 17, 2008 Google AppEngine partial outage due to programming error last 5 hours February 15, 2008 S3 outage: the authentication service overload leading to unavailability last 2 hours July 20, 2008 S3 outage: single bit error leading to gossip protocol blowup last >6 hours August 11, 2008 Gmail site unavailable due to outage in contacts system last 1.5 hours June 29, 2010 intermittent performance problems last 3 hours 5

… August, 2013 meltdowns Amazon: $7,000,000/100min Google: $550,000/5min March 13-14, 2008 Malfunction in Windows Azure last 22 hours February 24, 2009 Gmail and Google Apps Engine outage last 2.5 hours January 31, 2009 Google search outage due to programming error last 40 min June 17, 2008 Google AppEngine partial outage due to programming error last 5 hours February 15, 2008 S3 outage: the authentication service overload leading to unavailability last 2 hours July 20, 2008 S3 outage: single bit error leading to gossip protocol blowup last >6 hours August 11, 2008 Gmail site unavailable due to outage in contacts system last 1.5 hours June 29, 2010 intermittent performance problems last 3 hours 6

Detection Diagnosis Fixing … 7

Currently, many trace-oriented monitoring frameworks exist, such as Dapper, Zipkin, X-ray, P-Tracer, MTracer, … However, two aspects need further investigations: –1. The method for specifying monitoring requirements. Developers or Administrator s Monitoring Requirements Pip IRONModel 8

Currently, many trace-oriented monitoring frameworks exist, such as Dapper, Zipkin, X-ray, P-Tracer, MTracer, … However, two aspects need further investigations: –1. The method for specifying monitoring requirements. – 2. The efficiency of monitoring. T 1 sec Request_1 Request_100 Request_1000 …… … Handled in a complex process. For example, a simple Google search request will trigger more than 200 sub-requests and cross hundreds of servers. Huge trace data Huge trace data Real-time vs 9 A problem faced by all existing methods!

Currently, many trace-oriented monitoring frameworks exist, such as Dapper, Zipkin, X-ray, P-Tracer, MTracer, … However, two aspects need further investigations: –1. The method for specifying monitoring requirements. – 2. The efficiency of monitoring. To facilitate these issues, we bring runtime verification (RV) into the field of the trace-oriented monitoring for cloud systems. Runtime Verification –Expressive specification languages –Automatic monitor generation –Efficient monitoring 10

Framework 11

12 A Cloud System Tracing System PreprocessMonitors Monitor Generator PDB properties traces trace data collecting results

The trace records the execution path of a user request. Trace = events + relationships Event: function name and latency … Relationship: local and remote function calls … Trace → Trace Tree Nodes correspond to events. Edges correspond to relationships, e.g., a and c. Trace Tree → linear event sequence DFS: 1,2,4,3,5 Call and Return: C1C2C4R4R2C3C5R5R3R1 Compared with the resource-oriented methods, traces can record more find-grained information, e.g., RPC and execution time. 13

Preliminary Evaluation 14

We collected a trace data set (TraceBench) in an HDFS deployed on a real environment, –Considering different kinds of user requests write: uploading files to HDFS read: downloading files from HDFS rpc: file management, like querying, removing, renaming, … –Injecting various faults –With various cluster size, request speed, etc. 15

16 Based on TraceBench, we extract many such properties and we can correctly and flexibly expressing them all! Following are some samples. Each read request contains at least one reading operation. And the last reading operationshould be successful. Or else, we say it is a failed read request.

In the form of SQL queries, we check the traces with above properties in all sets with faults injected in. 100% of failed traces are identified without FPs. Several failed traces are also found in the Normal set, with the reason of losing events in the tracing system. 17

18 Checking traces in killDN set using Property 2 with a notebook of 4×2.5 GHz CPU and 4 GB memory. About 10,000 traces can be checked in 1 second in this condition, which is a promising result. In addition, the efficiency can be further improved with various optimizations.

Future work 19

Integrating existing RV frameworks into our tracing system Highly efficient and scalable monitoring algorithms and effective machine learning methods for properties. Using RV to monitor the performance aspects More applications on other real world cloud systems 20

Tracing Framework available at: Data set available at: Online demonstration at: 21 Framework and Data Set

Thanks for Your Attention! And Any Questions?

Backup Slides 23

HDFS In the rest, the traces discussed are collected in HDFS, which is a widely used cloud file storage system. 24

Starts with getFileInfo, to know if file exists. Followed by some other RPCs, for related operations. rpc A failure occurs when a violation happens. 25

read Consist of many data block reading operations –starting with blockSeekTo (B), the last one should be correct –indicated by checksumOK (K). A failure occurs when a violation happens. 26

write Similar with read, consists many data block writing operations, –by calling createBlockOutputStream (C). and the last one should be correct –indicated by the equality of receiveBlock (R) and wirteBlock (W). where O a is the abstract next operator in CaRet. And a failure occurs when a violation happens. 27