Smoke and Mirrors: Shadowing Files at a Geographically Remote Location Without Loss of Performance - and - Message Logging: Pessimistic, Optimistic, Causal,

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University.
More on File Management
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
Rethink the Sync Ed Nightingale Kaushik Veeraraghavan Peter Chen Jason Flinn University of Michigan.
CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.
Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
1 Disk Based Disaster Recovery & Data Replication Solutions Gavin Cole Storage Consultant SEE.
CS162 Section Lecture 9. KeyValue Server Project 3 KVClient (Library) Client Side Program KVClient (Library) Client Side Program KVClient (Library) Client.
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Smoke and Mirrors: Shadowing Files at a Geographically Remote Location Without Loss of Performance August 2008 Hakim Weatherspoon, Lakshmi Ganesh, Tudor.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
1 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
Overview Distributed vs. decentralized Why distributed databases
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
National Manager Database Services
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Wide-Area Cooperative Storage with CFS Robert Morris Frank Dabek, M. Frans Kaashoek, David Karger, Ion Stoica MIT and Berkeley.
J.H.Saltzer, D.P.Reed, C.C.Clark End-to-End Arguments in System Design Reading Group 19/11/03 Torsten Ackemann.
Networked File System CS Introduction to Operating Systems.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
1 System Models. 2 Outline Introduction Architectural models Fundamental models Guideline.
Smoke and Mirrors: Shadowing Files at a Geographically Remote Location Without Loss of Performance Hakim Weatherspoon Joint with Lakshmi Ganesh, Tudor.
CH2 System models.
TCP Throughput Collapse in Cluster-based Storage Systems
Distributed File Systems
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
Page 1 of John Wong CTO Twin Peaks Software Inc. Mirror File System A Multiple Server File System.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Lecture 12 Recoverability and failure. 2 Optimistic Techniques Based on assumption that conflict is rare and more efficient to let transactions proceed.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.
Oracle's Distributed Database Bora Yasa. Definition A Distributed Database is a set of databases stored on multiple computers at different locations and.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
1 Distributed Databases BUAD/American University Distributed Databases.
Databases Illuminated
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Effective Replica Maintenance for Distributed Storage Systems USENIX NSDI’ 06 Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon,
© Logicalis Group HA options Cross Site Mirroring (XSM) and Orion Jonathan Woods.
6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
Distributed Databases
Ivy: A Read/Write Peer-to- Peer File System Authors: Muthitacharoen Athicha, Robert Morris, Thomer M. Gil, and Benjie Chen Presented by Saurabh Jha 1.
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
Outline Announcements Fault Tolerance.
Operating System Reliability
7.1. CONSISTENCY AND REPLICATION INTRODUCTION
Operating System Reliability
Overview Continuation from Monday (File system implementation)
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
Operating System Reliability
Presentation transcript:

Smoke and Mirrors: Shadowing Files at a Geographically Remote Location Without Loss of Performance - and - Message Logging: Pessimistic, Optimistic, Causal, and Optimal Smoke and Mirrors: Shadowing Files at a Geographically Remote Location Without Loss of Performance - and - Message Logging: Pessimistic, Optimistic, Causal, and Optimal presenter Hakim Weatherspoon CS 6464 February 5 th, 2009

Critical Infrastructure Protection and Compliance  U.S. Department of Treasury Study Financial Sector vulnerable to significant data loss in disaster Need new technical options  Risks are real, technology available, Why is problem not solved?

Bold New (Distributed Systems) World…  High performance applications coordinate over vast distances Systems coordinate to cache, mirror, tolerate disaster, etc. Enabled by cheap storage and cpus, commodity datacenters, and plentiful bandwidth  Scale is amazing, but tradeoffs are old ones

Mirroring and speed of light dilemma…  Want asynchronous performance to local data center  And want synchronous guarantee Primary site Remote mirror async sync Conundrum: there is no middle ground

Mirroring and speed of light dilemma…  Want asynchronous performance to local data center  And want synchronous guarantee Primary site Remote mirror sync Conundrum: there is no middle ground Local-sync Remote-sync

Challenge  How can we increase reliability of local-sync protocols? Given many enterprises use local-sync mirroring anyways  Different levels of local-sync reliability Send update to mirror immediately Delay sending update to mirror – deduplication reduces BW

Talk Outline  Introduction  Enterprise Continuity How data loss occurs How we prevent it A possible solution  Evaluation  Message Logging: Pessimisitic, Optimistic, Causal, and Optimal  Cornell National Lambda Rail (NLR) Testbed  Conclusion

How does loss occur? Primary site Remote mirror  Rather, where do failures occur?  Rolling disasters Packet loss Partition Site Failure Power Outage

Enterprise Continuity: Network-sync Local-sync Network-syncRemote-sync Wide-area network Primary site Remote mirror

Enterprise Continuity Middle Ground  Use network level redundancy and exposure reduces probability data lost due to network failure Primary site Remote mirror Data Packet Repair Packet Network-level Ack Storage-level Ack

Enterprise Continuity Middle Ground  Network-sync increases data reliability reduces data loss failure modes, can prevent data loss if At the same time primary site fail network drops packet And ensure data not lost in send buffers and local queues  Data loss can still occur Split second(s) before/after primary site fails… Network partitions Disk controller fails at mirror Power outage at mirror  Existing mirroring solutions can use network-sync

Smoke and Mirrors File System  A file system constructed over network-sync Transparently mirrors files over wide-area Embraces concept: file is in transit (in the WAN link) but with enough recovery data to ensure that loss rates are as low as for the remote disk case! Group mirroring consistency

Mirroring consistency and Log-Structured File System B2B1 append( B1,B2 ) V1R1I2B4B3I1 append( V1.. ) V1 R1 I2 I1 B2 B1 B3B4

Talk Outline  Introduction  Enterprise Continuity  Evaluation  Message Logging  Conclusion

Evaluation  Demonstrate SMFS performance over Maelstrom In the event of disaster, how much data is lost? What is system and app throughput as link loss increases? How much are the primary and mirror sites allowed to diverge?  Emulab setup 1 Gbps, 25ms to 100ms link connects two data centers Eight primary and eight mirror storage nodes 64 testers submit 512kB appends to separate logs  Each tester submits only one append at a time

Data loss as a result of disaster  Local-sync unable to recover data dropped by network  Local-sync+FEC lost data not in transit  Network-sync did not lose any data Represents a new tradeoff in design space Primary site Remote mirror - 50 ms one-way latency - FEC(r,c) = (8,3) Local- sync Network- sync Remote- sync

Data loss as a result of disaster  c = 0, No recovery packets: data loss due to packet loss  c = 1, not sufficient to mask packet loss either  c > 2, can mask most packet loss  Network-sync can prevent loss in local buffers Primary site Remote mirror - 50 ms one-way latency - FEC(r,c) = (8,varies) -1% link loss Local- sync Network- sync Remote- sync

High throughput at high latencies

Application Throughput  App throughput measures application perceived performance  Network and Local-sync+FEC tput significantly greater than Remote-sync(+FEC)

…There is a tradeoff

Latency Distributions

Cornell National Lambda Rail (NLR) Rings

Cornell NLR Rings Testbed: Data loss as a result of disaster  Local-sync unable to recover data dropped by network  Both Local-sync+FEC and Network-sync did not lost data Primary site Remote mirror - 37 ms one-way latency - FEC(r,c) = (8,3) Local- sync Network- sync Remote- sync

Talk Outline  Introduction  Enterprise Continuity  Evaluation  Message Logging  Conclusion

Message Logging  Optimistic Return result to client before recording to stable storage  Pessimistic Record result to stable storage first, then return result  Question, how to prevent orphaned processes? After crash, How does a crash-recovery node recover state of system before crash? Pessimistic case is easy, just use log Optimistic case?....  Either data lost  Possibly can scrape state from clients – (possible project idea)

Message Logging  e.g. “Lehmen Brothers Network Survives” at.html  Pharmacy prescriptions after hurricane Katrina

Talk Outline  Introduction  Enterprise Continuity  Evaluation  Message Logging  Conclusion

Conclusion  Technology response to critical infrastructure needs  When does the filesystem return to the application? Fast — return after sending to mirror Safe — return after ACK from mirror  SMFS — return to user after sending enough FEC  Network-sync: Lossy Network  Lossless Network  Disk!  Result: Fast, Safe Mirroring independent of link length!

Next Time  Read NFS and write review. Turn into CMS before class: Wide-area cooperative storage with CFS, Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, and Ion Stoica. Appears in Proceedings of 18th ACM SIGOPS Symposium on Operating Principles (SOSP), October, Chord: A Peer-to-Peer Lookup Service for Internet Applications, Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, Hari Balakrishnan. Appears in Proceedings of the ACM SIGCOMM Conference, September,  Do Lab 1---Due next Tuesday, February 10 th

 Questions?

Future Work  Apply Network-sync technologies to large “DataNets”  High-speed online processing of massive quantities of real-time data: Performance enhancement proxies (PEP) Deep packet inspection (DPI) Sensor network monitoring Malware / worm signature detection Software switch, IPv4 – IPv6 gateway  Current state of the art – solution in the kernel  User process cannot keep up with kernel/NIC

 Backup slides

Future Work  Apply Network-sync technologies to large “DataNets” NSF focus on global-scale scientific computing DARPA interested in unification of globally distributed military data management subsystems and centers AFRL interested in high speed eventing (e.g. Distributed Missions Operations – DMO)

Featherweight Processes  User-Mode Processes with Fast Reflexes  User-mode process abstraction  Reduce copies between protection domains  Thin syscall interface Restrict blocking calls  Proactive resource request Enforce limits  Pin down memory  Kernel IPC – feather buffer

Revolver Buffers  Feather buffer IPC (kernel-feather process communication)  Pair of unidirectional revolving doors in each direction  Lock free, hold packet slots  Writing to the buffer From the kernel – while in softirq  Reading from the buffer While in the kernel – on a workqueue  Turn buffer around – in place processing

Technical Details  Thin syscall interface: ffork, mmap, await FFORK(feather buffer size, heap limit) MMAP – maps the feather buffer AWAIT – blocks feather process  Feather buffer interface put / get / peek / turnaround / available  Peg kernel and feather process threads