TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

TOTAL 23 SLIDES BELOW

The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY

CONTENTS Abstract Various survey reports of network reliability under different circumstance Conclusion

ABSTRACT “The network is reliable.” is a fallacy of distributed computing. The degree of network reliability is critical for systems to function robustly. It is hard to determine the degree of network reliability.

VARIOUS SURVEY REPORTS OF NETWORK RELIABILITY UNDER DIFFERENT CIRCUMSTANCE

LARGE DEPLOYMENTS & ISSUES What are large deployments? Large deployments mean a distributed network system that is run globally having distributed infrastructure with hundreds of thousands of servers. What is serious considered issue in large deployments? Partitions : A network partition refers to the failure of a network device that causes a network to be split

LARGE DEPLOYMENTS & ISSUES(CONTD.) EXAMPLES BEHAVIOR OF NETWORK FAILURE IN MICROSOFT DATACENTERS Average failure rate 5.2 devices/day 40.8 links/day. which causes Avg loss of 59000 packets per failure. Avg time to repair is of approximately five minutes Redundancy improves Avg traffic by 43%.

LARGE DEPLOYMENTS & ISSUES(CONTD.) EXAMPLES NETWORK FAILURES IN HP’S MANAGED NETWORKS Analysis of Support ticket data Connectivity-related tickets accounted for 11.4% 14% of which were of the highest priority level 2 hours and 45 minutes for the highest priority tickets and a median duration of 4 hours 18 minutes for all tickets

LARGE DEPLOYMENTS & ISSUES(CONTD.) EXA MPLES FIRST YEAR FOR NEW GOOGLE CLUSTER INVOLVES Five racks were faulty (40–80 machines seeing 50% packet loss) Eight network maintenances (four might cause 30- minute random connectivity losses) Three router failures (have to immediately pull traffic for an hour)

LARGE DEPLOYMENTS & ISSUES(CONTD.) How these companies try to repair network partitions? Google(by Dean): “easy-to use” abstractions PNUTS: Weeker consistency alternatives

DATACENTER NETWORK FAILURES A Datacenter of Google Main factors of Failures : 1)Power failure 2)Misconfiguration 3)Firmware bugs 4)Topology changes 5)Cable damage 6)Malicious traffic

CLOUD NETWORKS What is Cloud Networks? Key issues: 1)Transient latency 2)Dropped packets 3)Full network partitions

CLOUD NETWORKS(CONTD.) When two nodes connected to the internet but unable to see each other? What experience can we learn from this case?

HOST PRVIDERS  Could host providers offer reliable networks? E.g. Freistil IT : a specific data center has50%-100%packet loss that leads GlusterFS disturbuted file system to entire split-brain undetected  Why?  What is the main issue?

WIDE AREA NETWORKS(WAN) Why WAN failures are particularly interesting? Example: CENIC: Average partition duration(5 years): SRF: 6 mins HRF:8.2 hours  Conclusion: Graceful degradation Under partition or increased Latency is especially important for WAN.

GLOBAL ROUTING FAILURES Can a high level redundancy internet system be safe?  1) Firewall configuration error: e.g CloudFlare  2)Firmware bug: e.g Juniper Networks  3) BGP misconfiguration: e.g Pakistan Telecom

NICS AND DRIVERS Firmware bug: NICs problem e.g. BCM5709 (chip model) Misconfiguration : Drivers problem e.g. bnx2

APPLICATION-LEVEL FAILURES What are the issues causing messages drop ping and delay? 1).Crashes 2). Program errors 3).Scheduler latency 4).Overloaded processes

CONCLUSION Where are the communication failures occur? Processes Servers NICs, switches local and wide area networks Etc.

CONCLUSION(CONTD.) Whether there exist a reliable network? Depends on 1).Cautious engineering 2)Aggressive network advance 3).Lots of investments

CONCLUSION(CONTD.) What can we do ? Consider the risk before a partition occurs.

QUESTIONS TIME ! LOL!

REFERENCES "Physical Network Interface". Microsoft. January 7, 2009. Stonebraker, Michael (April 5, 2010). "Errors in Database Systems, Eventual Consistency, and the CAP Theorem". Communications of the ACM CityCloud, 2011; https://www.citycloud.eu/cloudcomputing/ post-mortem/. Davidson, S.B., Garcia-Molina, H. and Skeen, D. Consistency in a partitioned network: A survey. ACM Computing Surveys 17, 3 (1985), 341–370; http:// dl.acm.org/citation.cfm?id=5508.

THANK YOU FOR YOUR PATIENCE

TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

Similar presentations

Presentation on theme: "TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY.

Similar presentations

Presentation on theme: "TOTAL 23 SLIDES BELOW The network is Reliable An informal survey of real-world communications failures BY PETER BAILIS AND KYLE KINGSBURY."— Presentation transcript:

Similar presentations

About project

Feedback