Presentation is loading. Please wait.

Presentation is loading. Please wait.

Oct. 2007 MMNS (San Jose) Distributed Self Fault-Diagnosis for SIP Multimedia Applications Kai X. Miao (Intel) Henning Schulzrinne (Columbia U.) Vishal.

Similar presentations


Presentation on theme: "Oct. 2007 MMNS (San Jose) Distributed Self Fault-Diagnosis for SIP Multimedia Applications Kai X. Miao (Intel) Henning Schulzrinne (Columbia U.) Vishal."— Presentation transcript:

1 Oct. 2007 MMNS (San Jose) Distributed Self Fault-Diagnosis for SIP Multimedia Applications Kai X. Miao (Intel) Henning Schulzrinne (Columbia U.) Vishal Kumar Singh (Columbia U./Motorola) Qianni Deng (Shanghai Jiaotong University)

2 Oct. 2007 MMNS (San Jose) Overview The transition in IT cost metrics End-to-end application-visible reliability still poor (~ 99.5%) –even though network elements have gotten much more reliable –particular impact on interactive applications (e.g., VoIP) –transient problems Lots of voodoo network management Existing network management doesn’t work for VoIP and other modern applications Need user-centric rather than operator-centric management Proposal: peer-to-peer management –“Do You See What I See?” Using VoIP as running example -- most complex consumer application –but also applies to IPTV and other services Also use for reliability estimation and statistical fault characterization

3 Oct. 2007 MMNS (San Jose) Circle of blame OS VSP app vendor ISP must be a Windows registry problem  re-install Windows probably packet loss in your Internet connection  reboot your DSL modem must be your software  upgrade probably a gateway fault  choose us as provider

4 Oct. 2007 MMNS (San Jose) Diagnostic undecidability symptom: “cannot reach server” more precise: send packet, but no response causes: –NAT problem (return packet dropped)? –firewall problem? –path to server broken? –outdated server information (moved)? –server dead? 5 causes  very different remedies –no good way for non-technical user to tell Whom do you call?

5 Oct. 2007 MMNS (San Jose) Traditional network management model SNMP X “management from the center”

6 Oct. 2007 MMNS (San Jose) Old assumptions, now wrong Single provider (enterprise, carrier) –has access to most path elements –professionally managed Problems are hard failures & elements operate correctly –element failures (“link dead”) –substantial packet loss Mostly L2 and L3 elements –switches, routers –rarely 802.11 APs Problems are specific to a protocol –“IP is not working” Indirect detection –MIB variable vs. actual protocol performance End systems don’t need management –DMI & SNMP never succeeded –each application does its own updates

7 Oct. 2007 MMNS (San Jose) Managing the protocol stack RTP UDP/TCP IP SIP no route packet loss TCP neg. failure NAT time-out firewall policy protocol problem playout errors media echo gain problems VAD action protocol problem authorization asymmetric conn (NAT)

8 Oct. 2007 MMNS (San Jose) Types of failures Hard failures –connection attempt fails –no media connection –NAT time-out Soft failures (degradation) –packet loss (bursts) access network? backbone? remote access? –delay (bursts) OS? access networks? –acoustic problems (microphone gain, echo)

9 Oct. 2007 MMNS (San Jose) Examples of additional problems ping and traceroute no longer works reliably –WinXP SP 2 turns off ICMP –some networks filter all ICMP messages Early NAT binding time-out –initial packet exchange succeeds, but then TCP binding is removed (“web-only Internet”) policy intent vs. failure –“broken by design” –“we don’t allow port 25” vs. “SMTP server temporarily unreachable”

10 Oct. 2007 MMNS (San Jose) Fault localization Fault classification – local vs. global –Does it affect only me or does it affect others also? Global failures –Server failure e.g., SIP proxy, DNS failure, database failures –Network failures Local failures –Specific source failure node A cannot make call to anyone –Specific destination or participant failure no one can make call to node B –Locally observed, but global failures DNS service failed, but only B observed it

11 Oct. 2007 MMNS (San Jose) Proposal: “Do You See What I See?” Each node has a set of active and passive measurement tools Use intercept (NDIS, pcap) –to detect problems automatically e.g., no response to HTTP or DNS request –gather performance statistics (packet jitter) –capture RTCP and similar measurement packets Nodes can ask others for their view –possibly also dedicated “weather stations” Iterative process, leading to: –user indication of cause of failure –in some cases, work-around (application-layer routing)  TURN server, use remote DNS servers Nodes collect statistical information on failures and their likely causes DYSWIS

12 Oct. 2007 MMNS (San Jose) Architecture “not working” (notification) inspect protocol requests (DNS, HTTP, RTCP, …) “DNS failure for 15m” orchestrate tests contact others ping 127.0.0.1 can buddy reach our resolver? notify admin (email, IM, SIP events, …) request diagnostics

13 Oct. 2007 MMNS (San Jose) Solution architecture DNS Server P2P Service Provider 1 Service Provider 2 P1 P2 P3 Domain A P5 P4 P6 P7 P8 DNS Test PESQ Test SIP Server SIP Test Call Failed at P1 Nodes in different domains cooperating to determine cause of failure

14 Oct. 2007 MMNS (San Jose) Failure detection tools STUN server –what is your IP address? ping and traceroute Transport-level liveness and QoS –open TCP connection to port –send UDP ping to port –measure packet loss & jitter Need scriptable tools with dependency graph –using DROOLS for now TBD: remote diagnostic –fixed set (“do DNS lookup”) or –applets (only remote access) media RTP UDP/TCP IP

15 Oct. 2007 MMNS (San Jose) Dependency classification Functional dependency –At generic service level e.g., SIP proxy depends on DB service, DNS service Structural dependency –Configuration time e.g., Columbia CS SIP proxy is configured to use mysql database on host metro-north Operational dependency –Runtime dependencies or run time bindings e.g., the call which failed was using failover SIP server obtained from DNS which was running on host a.b.c.d in IRT lab

16 Oct. 2007 MMNS (San Jose) Dependency Graph

17 Oct. 2007 MMNS (San Jose) Dependency graph encoded as decision tree A C B D A Failed, Use Decision Tree Yes Invokes Decision Tree for C No Yes Invokes Decision Tree for B Invokes Decision Tree for D Cause Not Known Report, Add new Dependency A B C D A = SIP Call C = SIP Proxy B = DNS Server D = Connectivity

18 Oct. 2007 MMNS (San Jose) Current work Building decision tree system Using JBoss Rules (Drools 3.0)

19 Oct. 2007 MMNS (San Jose) Future work Learning the dependency graph from failure events and diagnostic tests Learning using random or periodic testing to identify failures and determine relationships Self healing Predicting failures Protocols for labeling event failures --> enable automatically incorporating new devices/applications to the dependency system Decision tree (dependency graph) based event correlation

20 Oct. 2007 MMNS (San Jose) Conclusion Hypothesis: network reliability as single largest open technical issue  prevents (some) new applications Existing management tools of limited use to most enterprises and end users Transition to “self-service” networks –support non-technical users, not just NOCs running HP OpenView or Tivoli Need better view of network reliability


Download ppt "Oct. 2007 MMNS (San Jose) Distributed Self Fault-Diagnosis for SIP Multimedia Applications Kai X. Miao (Intel) Henning Schulzrinne (Columbia U.) Vishal."

Similar presentations


Ads by Google