G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, Jennifer.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

2 Introduction A central issue in supporting interoperability is achieving type compatibility. Type compatibility allows (a) entities developed by various.
CS188: Computational Models of Human Behavior
Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.
Introduction to Embedded Systems Resource Management - III Lecture 19.
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
ARCHITECTURES FOR ARTIFICIAL INTELLIGENCE SYSTEMS
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
Modeling Human Reasoning About Meta-Information Presented By: Scott Langevin Jingsong Wang.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Introduction To System Analysis and Design
Trajectory Sampling for Direct Traffic Observation Matthias Grossglauser joint work with Nick Duffield AT&T Labs – Research.
Knowledge Acquisitioning. Definition The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
THE OBJECT-ORIENTED DESIGN WORKFLOW Interfaces & Subsystems.
Modified from Sommerville’s originalsSoftware Engineering, 7th edition. Chapter 8 Slide 1 System models.
Mining Behavior Models Wenke Lee College of Computing Georgia Institute of Technology.
Introduction. 2 What Is SmartFlow? SmartFlow is the first application to test QoS and analyze the performance and behavior of the new breed of policy-based.
Unconstrained Endpoint Profiling (Googling the Internet)‏ Ionut Trestian Supranamaya Ranjan Aleksandar Kuzmanovic Antonio Nucci Northwestern University.
Maintaining and Updating Windows Server 2008
Introduction to Software Testing
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 17 Slide 1 Rapid software development.
Super-Resolution of Remotely-Sensed Images Using a Learning-Based Approach Isabelle Bégin and Frank P. Ferrie Abstract Super-resolution addresses the problem.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Problems with reuse – Increased maintenance costs; lack of tool support; not-invented- here syndrome; creating, maintaining, and using a component library.
Correlations, Alarms and Policies
Introduction To System Analysis and design
1 USING EXPERT SYSTEMS TECHNOLOGY FOR STUDENT EVALUATION IN A WEB BASED EDUCATIONAL SYSTEM Ioannis Hatzilygeroudis, Panagiotis Chountis, Christos Giannoulis.
MCTS Guide to Microsoft Windows 7
1 Meeyoung Cha, Sue Moon, Chong-Dae Park Aman Shaikh Placing Relay Nodes for Intra-Domain Path Diversity To appear in IEEE INFOCOM 2006.
Chapter 1 Introduction to Simulation
1 Validation & Verification Chapter VALIDATION & VERIFICATION Very Difficult Very Important Conceptually distinct, but performed simultaneously.
Top-Down Network Design Chapter Nine Developing Network Management Strategies Oppenheimer.
Focused Matrix Factorization for Audience Selection in Display Advertising BHARGAV KANAGAL, AMR AHMED, SANDEEP PANDEY, VANJA JOSIFOVSKI, LLUIS GARCIA-PUEYO,
Security for the Optimized Link- State Routing Protocol for Wireless Ad Hoc Networks Stephen Asherson Computer Science MSc Student DNA Lab 1.
1 Apache. 2 Module - Apache ♦ Overview This module focuses on configuring and customizing Apache web server. Apache is a commonly used Hypertext Transfer.
Architecture styles Pipes and filters Object-oriented design Implicit invocation Layering Repositories.
UNIT 3 SEMINAR LS504: Applied Research in Legal Studies.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Management & Development of Complex Projects Course Code MS Project Management Perform Qualitative Risk Analysis Lecture # 25.
R R R 1 Frameworks III Practical Issues. R R R 2 How to use Application Frameworks Application developed with Framework has 3 parts: –framework –concrete.
© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.
Chapter 10 Analysis and Design Discipline. 2 Purpose The purpose is to translate the requirements into a specification that describes how to implement.
OS Services And Networking Support Juan Wang Qi Pan Department of Computer Science Southeastern University August 1999.
NetSearch: Googling Large-scale Network Management Data GROUP 2 MEMBERS SAMUEL LAWER WENBO HAN HUAN YAN PEI YAN SHREY YADAV SHUAI YU SHINE PANDITA.
Thomson South-Western Wagner & Hollenbeck 5e 1 Chapter Sixteen Critical Thinking And Continuous Learning.
Project Deliverables CEN Engineering of Software 2.
Generic Tasks by Ihab M. Amer Graduate Student Computer Science Dept. AUC, Cairo, Egypt.
1 Knowledge Acquisition and Learning by Experience – The Role of Case-Specific Knowledge Knowledge modeling and acquisition Learning by experience Framework.
PART3 Data collection methodology and NM paradigms 1.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Network management Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance,
Company LOGO Network Management Architecture By Dr. Shadi Masadeh 1.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
1 Architecture and Behavioral Model for Future Cognitive Heterogeneous Networks Advisor: Wei-Yeh Chen Student: Long-Chong Hung G. Chen, Y. Zhang, M. Song,
Project Deliverables CIS 4328 – Senior Project 2 And CEN Engineering of Software 2.
1 Effective Diagnosis of Routing Disruptions from End Systems Ying Zhang Z. Morley Mao Ming Zhang.
Statistical process model Workshop in Ukraine October 2015 Karin Blix Quality coordinator
The Development Process of Web Applications
Visiting human errors in IR systems from decision making perspective
RMON.
Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.
Introduction to Software Engineering
Chapter 10 Verification and Validation of Simulation Models
Introduction to Software Testing
Chapter 2 – Software Processes
Chapter 10 – Software Testing
Rational Rose 2000 Instructor Notes Use Case Realization Structure
Unconstrained Endpoint Profiling (Googling the Internet)‏
Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.
Presentation transcript:

G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, Jennifer Yates

Abstract ●Best effort networks --> QoS ●Manage end-to-end service quality as a whole ●Generic Root Cause Analysis (G-RCA) o Service Quality Management (SQM) ●FCAPS

Introduction Finding root to errors –transient errors Gather information for network operators Helps Service Quality Management (SQM) for ISPs.

G-RCA Architecture Consists of five main components. G-RCA determines where and when to look for diagnostic events. Used for: –Troubleshoot ongoing networks –Investigate past behavior.

Data Collection and Management Proactively collects data from network, such as alarms, logs and performance measurements. Uses a data collector and database to store data “Events” –event-name, location type, retrieval process and information

Service Dependency Model ●Figure 2 used to include network elements associated with a problem ●Hard to realize theory o Traffic sampling data o Snapshots of router configs

Spatial-Temporal Correlation (1) ● How to relate what has happened to service problem? ●G-RCA defines a temporal and spatial joining rule ●Temporal Joining Rule ○Defines a time window to allow symptom and diagnostic event to be joined. ○6 parameters for symptom & diagnostic event ■ Left expansion margin ■ Right expansion margin ■ Expanding option (Start/End, Start/Start or End/End)

Spatial-Temporal Correlation (2) ○ Symptom and diagnostic event are joint when the windows overlap.

Spatial-Temporal Correlation (3) ● Spatial Joining Rule ○Symptom event location type ○Diagnostic event location type ○Joining level ●Joining level ○Link symptom locations and diagnostic event locations together ●Model diagnostic signatures using diagnosis graph ●A symptom and diagnostic event pair is called diagnosis rule ●G-RCA evaluates the time and location conditions and collected data ●Determine whether diagnostic signature is present

Reasoning Logic Rule-Based Reasoning Module Priority value in the diagnosis graph – Assigned by operator – Higher value means more confidence on the diagnostic event to be the real root cause – Can be examined by G-RCA’s Result Browser How does rule-based reasoning work?

Diagnosis graph for BGP flaps root cause analysis

Bayesian Inference Determining the root cause is to identify the one producing the following maximum likelihood ratio: When the features are conditionally independent – The second term can be decoupled to Parameters configuration (ratios of: and ) – bootstrap using the rule-based reasoning – define a fuzzy type of discrete values Low, Medium, and High, which corresponds to values 2, 100, and Potential root causes: classes A set of r presence or absence of the diagnostic evidence and symptom events themselves : features First term Second term

Comparison In the operational practice,rule-based reasoning logic is often preferred over Bayesian inference – Easier to configure – Gives simple and direct association between the diagnosed root cause and the evidence – Effective in most applications However, there are a few cases where Bayesian inference is preferred – Root cause condition is unobservable

Domain Knowledge Building ●Issue: The specification of a diagnosis graph for a SQM application offered by an operator, especially the initial version, can be inaccurate and incomplete. ● G-RCA addresses this concern regarding incomplete diagnosis graph through iteratively using the Correlation Tester and Result Browser. ○ Firstly, operator filters out the symptom events with known root causes with the root cause classification capability provided in the result browser. ○ Secondly, operator could focus on the rest of symptom events by comparing with other suspected diagnostic events that occur at the same time and that are spatially related to the service problem.

Domain Knowledge Building ●On one hand, the second step can be done via manual drill-down and data exploration capability in the result browser; ●On the other hand, operators can also to run the correlation tester blindly between the symptom events without known root causes and each type of suspected diagnostic graph. ●As G-RCA emphasizes usability, the newly uncovered diagnosis rules need to be verified by operators before incorporating into the diagnosis graph.

Introduction of G-RCA Applications The key advantage of G-RCA in SQM is its capability to be rapidly customized into different RCA applications in the ISP’s network. In this section, the following three case studies are included in order to demonstrate effectiveness of G-RCA –1) customer BGP flaps –2) end-to-end throughput management in a CDN service –3) network PIM flaps in multicast VPN

BGP Flaps Root Cause Analysis Purpose: Understanding the root cause of flaps. ●Achieving this using G-RCA by constructing application specific events and rules. ○ Starting by constructing our BGP flap-specific events. ○ Adding a few application-specific diagnosis rules. ○ Specifying priorities for different diagnosis rules for BGP flaps RCA. (Please refer to the figure of “Diagnosis graph for BGP flaps root cause analysis” shown in the previous slides) Application-specific events for BGP flaps root cause analysis

Conclusion 1. It captures the layered network model in its knowledge library, by implementing -temporal/spatial correlation, -rule-based reasoning, and -Bayesian inference. 2. Domain knowledge in existing RCA application can be refined by the interaction between the RCA engine and the Correlation Tester. 3. In order to analyse a large number of service quality issues and classify trend their root causes, it proactively collects all types of data from different sources and normalize them in real time.