Download presentation
Presentation is loading. Please wait.
Published byAubrey Wilcox Modified over 8 years ago
1
Fault Localization via Analysis of Network Dependency http://pmon Victor Bahl, Ranveer Chandra, Albert Greenberg, Dave Maltz, Ming Zhang (MSR Redmond) Failure of Management Systems Challenges Mission Automatically Localizing Faults State of the Art Example Extracted Dependencies On-Going Work 10% of requests to internal servers take 10x longer than normal Persistent user frustration and high care costs Invisible to current management systems Automatically Creating Models of Dependencies Response time of 1 web server Response time of 17 servers ~10 % A typical large enterprise ~100,000 client desktops ~10,000 servers ~10,000 apps/services ~10,000 network devices Service alerts for 10 days 120,000 “housekeeping” 2,000 missed heartbeats from 160 servers 18,000 alerts from 194 categories and 877 hosts SQL WebSvr Active Directory Client Machines MOM MAM SMS Scripts.... SMARTS SNMP NetFlow Scripts … Application Support Staff Network Support Staff Server Management Network Management RemoteDsktp SMS... Desktop Management Help Desk Support Staff DNS proxy What we have today: Interdependent distributed systems with hidden and unknown dependencies Plethora of tools for graphing SNMP values, paucity of tools for tracking relationships Little visibility into effect of network on applications What we want: Method to map the IT infrastructure - determining which components affect a given client activity Method to localize problems that affect users Read/Write SML models of applications Automatically generate SML for legacy apps Complement expert-generated SML Explore other applications of Inference Graph Upgrade management (who will be affected) Availability analysis (who is being impacted) Management systems do not provide a “big picture” Tools are box-centric – not service-centric Relationships among severs often undocumented Fragmentation results in more mistakes & outages Tools do not directly measure user experience ~10 % Identify Service Dependencies Fault Localization Packet traces at individual agents/ vantage-point routers Inference Graph Topology and other network information Inference Engine Observations: Client-server interaction logs, Trouble tickets, etc. Actions: Run TraceRoute x->y Fault Suspects: links, routers, servers, clients 12 DNS Server SQL Server web front 1 client A client B client A A → DNS A → WF1 DNS Server web front 1 WF1 → SQL SQL Server A → WF1 Root causes User Experience File Server 3 6 Nodes can be up, down, or troubled State of each node: (P up, P troubled, P down ) where P up + P troubled + P down = 1 4 7 client A A → DNS A → WF1 DNS Server web front 1 WF1 → SQL SQL Server A → WF1 1 0.8 1 1 1 1 1 1 1 1 10.3 1 (0, 0, 1) (1, 0, 0) (0, 0, 1) (1, 0, 0) A → FS 0.3 A → FS File Server (1, 0, 0) 1 1 1 Results Algorithm for extraction of dependency models Sniffs and correlates packets between hosts Algorithm for flexible & accurate fault localization Scalable to size of large enterprises Localizes both hard and performance faults Finds problems in network, even without data from network routers Deployed and evaluated on testbed and several MSIT applications (e.g., msw, itweb) Model is probabilistic to cope with caching, load balancing and failover techniques
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.