Presentation is loading. Please wait.

Presentation is loading. Please wait.

TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

Similar presentations


Presentation on theme: "TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY."— Presentation transcript:

1

2 TOTAL 23 SLIDES BELOW

3 The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY

4 CONTENTS Abstract Various survey reports of network reliability under different circumstance Conclusion

5 ABSTRACT “The network is reliable.” is a fallacy of distributed computing. The degree of network reliability is critical for systems to function robustly. It is hard to determine the degree of network reliability.

6 VARIOUS SURVEY REPORTS OF NETWORK RELIABILITY UNDER DIFFERENT CIRCUMSTANCE

7 LARGE DEPLOYMENTS & ISSUES What are large deployments? Large deployments mean a distributed network system that is run globally having distributed infrastructure with hundreds of thousands of servers. What is serious considered issue in large deployments? Partitions : A network partition refers to the failure of a network device that causes a network to be split

8 LARGE DEPLOYMENTS & ISSUES(CONTD.) EXAMPLES BEHAVIOR OF NETWORK FAILURE IN MICROSOFT DATACENTERS Average failure rate 5.2 devices/day 40.8 links/day. which causes Avg loss of 59000 packets per failure. Avg time to repair is of approximately five minutes Redundancy improves Avg traffic by 43%.

9 LARGE DEPLOYMENTS & ISSUES(CONTD.) EXAMPLES NETWORK FAILURES IN HP’S MANAGED NETWORKS Analysis of Support ticket data Connectivity-related tickets accounted for 11.4% 14% of which were of the highest priority level 2 hours and 45 minutes for the highest priority tickets and a median duration of 4 hours 18 minutes for all tickets

10 LARGE DEPLOYMENTS & ISSUES(CONTD.) EXA MPLES FIRST YEAR FOR NEW GOOGLE CLUSTER INVOLVES Five racks were faulty (40–80 machines seeing 50% packet loss) Eight network maintenances (four might cause 30- minute random connectivity losses) Three router failures (have to immediately pull traffic for an hour)

11 LARGE DEPLOYMENTS & ISSUES(CONTD.) How these companies try to repair network partitions? Google(by Dean): “easy-to use” abstractions PNUTS: Weeker consistency alternatives

12 DATACENTER NETWORK FAILURES A Datacenter of Google Main factors of Failures : 1)Power failure 2)Misconfiguration 3)Firmware bugs 4)Topology changes 5)Cable damage 6)Malicious traffic

13 CLOUD NETWORKS What is Cloud Networks? Key issues: 1)Transient latency 2)Dropped packets 3)Full network partitions

14 CLOUD NETWORKS(CONTD.) When two nodes connected to the internet but unable to see each other? What experience can we learn from this case?

15 HOST PRVIDERS  Could host providers offer reliable networks? E.g. Freistil IT : a specific data center has50%-100%packet loss that leads GlusterFS disturbuted file system to entire split-brain undetected  Why?  What is the main issue?

16 WIDE AREA NETWORKS(WAN) Why WAN failures are particularly interesting? Example: CENIC: Average partition duration(5 years): SRF: 6 mins HRF:8.2 hours  Conclusion: Graceful degradation Under partition or increased Latency is especially important for WAN.

17 GLOBAL ROUTING FAILURES Can a high level redundancy internet system be safe?  1) Firewall configuration error: e.g CloudFlare  2)Firmware bug: e.g Juniper Networks  3) BGP misconfiguration: e.g Pakistan Telecom

18 NICS AND DRIVERS Firmware bug: NICs problem e.g. BCM5709 (chip model) Misconfiguration : Drivers problem e.g. bnx2

19 APPLICATION-LEVEL FAILURES What are the issues causing messages drop ping and delay? 1).Crashes 2). Program errors 3).Scheduler latency 4).Overloaded processes

20 CONCLUSION Where are the communication failures occur? Processes Servers NICs, switches local and wide area networks Etc.

21 CONCLUSION(CONTD.) Whether there exist a reliable network? Depends on 1).Cautious engineering 2)Aggressive network advance 3).Lots of investments

22 CONCLUSION(CONTD.) What can we do ? Consider the risk before a partition occurs.

23 QUESTIONS TIME ! LOL!

24 REFERENCES "Physical Network Interface". Microsoft. January 7, 2009. Stonebraker, Michael (April 5, 2010). "Errors in Database Systems, Eventual Consistency, and the CAP Theorem". Communications of the ACM CityCloud, 2011; https://www.citycloud.eu/cloudcomputing/ post-mortem/. Davidson, S.B., Garcia-Molina, H. and Skeen, D. Consistency in a partitioned network: A survey. ACM Computing Surveys 17, 3 (1985), 341–370; http:// dl.acm.org/citation.cfm?id=5508.

25 THANK YOU FOR YOUR PATIENCE


Download ppt "TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY."

Similar presentations


Ads by Google