Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1.

Slides:



Advertisements
Similar presentations
UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.
Advertisements

PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.
R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),
Consistency and Replication Chapter 7 Part II Replica Management & Consistency Protocols.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
The Google File System (GFS). Introduction Special Assumptions Consistency Model System Design System Interactions Fault Tolerance (Results)
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
Continuously Recording Program Execution for Deterministic Replay Debugging.
CS 582 / CMPE 481 Distributed Systems
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
CSS490 Replication & Fault Tolerance
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
EEC 688/788 Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
G Robert Grimm New York University Bayou: A Weakly Connected Replicated Storage System.
Recovery Techniques in Distributed Databases Naveen Jones December 5, 2011.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Deterministic Replay of Java Multithreaded Applications Jong-Deok Choi and Harini Srinivasan slides made by Qing Zhang.
Accelerating Mobile Applications through Flip-Flop Replication
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Quiz 2 Review.
Orbe: Scalable Causal Consistency Using Dependency Matrices & Physical Clocks Jiaqing Du, EPFL Sameh Elnikety, Microsoft Research Amitabha Roy, EPFL Willy.
Microsoft Research Asia Ming Wu, Haoxiang Lin, Xuezheng Liu, Zhenyu Guo, Huayang Guo, Lidong Zhou, Zheng Zhang MIT Fan Long, Xi Wang, Zhilei Xu.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
Huayang Guo 1,2, Ming Wu 1, Lidong Zhou 1, Gang Hu 1,2, Junfeng Yang 2, Lintao Zhang 1 1 Microsoft Research Asia 2 Columbia University Practical Software.
Distributed File System By Manshu Zhang. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Practical Byzantine Fault Tolerance
From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.
Byzantine fault tolerance
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Transaction Management Overview. Transactions Concurrent execution of user programs is essential for good DBMS performance. – Because disk accesses are.
Transaction Management Transparencies. ©Pearson Education 2009 Chapter 14 - Objectives Function and importance of transactions. Properties of transactions.
Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.
Replication Improves reliability Improves availability ( What good is a reliable system if it is not available?) Replication must be transparent and create.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.
State Machine Replication State Machine Replication through transparent distributed protocols State Machine Replication through a shared log.
Bigtable: A Distributed Storage System for Structured Data
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
Distributed Systems – Paxos
Transaction Management and Concurrency Control
CSCI5570 Large Scale Data Processing Systems
Introduction to NewSQL
Replication State Machines via Primary-Backup
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
湖南大学-信息科学与工程学院-计算机与科学系
Providing Secure Storage on the Internet
Outline Announcements Fault Tolerance.
Replication Improves reliability Improves availability
EECS 498 Introduction to Distributed Systems Fall 2017
Active replication for fault tolerance
Fault-tolerance techniques RSM, Paxos
From Viewstamped Replication to BFT
Replication State Machines via Primary-Backup
Replication State Machines via Primary-Backup
Distributed Systems CS
Presentation transcript:

Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

Tension between Replication and Multi- core Most applications are multi-threaded But, to replicate, you can only use single-thread Sacrifices performance for replication 2 Database Lock Server File Server Key-value Stores Multi-core Replication

Rex: Replication at the Speed of Multi-core 3 Replication Multi-core

Outline Motivation System Overview Implementation Evaluation 4

State Machine Replication To replicate a service: 1.Model as deterministic state machine 2.Order requests with consensus protocol 3.Execute with single-thread 5 Consensus Server Sequential Execution Consistent States Parallel Execution Multi-core Server Inconsistent States requests

Why Multi-thread Breaks State Machine Replication Non-deterministic decisions: locking order, etc… Replicas make decisions independently 6 Server 1 Server 2 Performance Consistency

Rex: Execute-Agree-Follow 7 Primary Traces Consensus Traces Secondary Execute AgreeFollow

Programming With Rex 1.Model app as RexRSM 2.Use Rex to make non-deterministic decisions RexLocks, RexCond, … RexTimeStamp, RexRand, etc. 8

Outline Motivation System Overview Implementation Evaluation 9

Normal Execution: Primary 10 2 lockA 3 unlockA 4 1 request 1 1 request 2 2 lockA reply unlockA reply 2 Primary Trace: (t1, 1, request 1) … Causal edge((t1, 3)->(t2, 2)) … (t1, 4, reply 1)... …

Normal Execution: Secondary 11 1 request 1 1 request 2 lockA 2 3 unlockA 4 reply 1 Secondary (t1, 1, request 1) … Causal edge((t1, 3)->(t2, 2)) … (t1, 4, reply 1)... … 3 4 unlockA reply 2 2 waited event

Primary Failover Primary restart from checkpoint rejoin Secondary upgrade to primary switch replay -> record 12 Committed Uncommitted Crash

Unique Challenges: Integrating Replication and Record/Replay Inconsistency cut “Holes” in logs Causal edge pruning Hybrid execution … 13

The Inconsistent Cut Problem Collects logs at each thread asynchronously Inconsistent cut contains destination nodes without source node Problem: not be able to follow 14

Solving Inconsistent Cut Problem Define consensus on last consistent cut Drop C1-C2 when primary fail Reply only when reply contained in a committed consistent cut Use vector clock to track 15

Outline Motivation System Overview Implementation Evaluation 16

Experiment Setup Real-world Applications Micro-benchmark: for lock contention ratio Servers: 12-core, 24-thread, 10GE network 17 AppDescription ThumbnailGenerating and storing thumbnails XLockLock server similar to Chubby File ServerFile server Kyoto CabinetKey-value store LevelDBLocal storage behind BigTable MemCachedCache server

Performance Overview 18 Rex scales as nonreplicated <24% overhead

LevelDB in Detail 19 # cores Waited events grows with # threads, so does overhead overhead drops with more threads to schedule

Lock Conflict Ratio 20 Overhead < 15%

Summary Rex: execute-agree-follow Applied to six real-world applications Preserves scalability and low overhead 21

Thanks! Q&A 22

Backups 23

Dealing with Data Races Reply logging & compare Resource version checking Lock-free data structures: NATIVE_EXEC Experience shows that getting rid of data races is doable 24

Workloads Thumbnail: 1 pic per request K-V stores: 1M pairs 16 byte key, 100 byte value 10% write File system: 16KB random requests 20% write Xlock: 90% lease renew 100B – 5KB file 25

Lock Granularity 26

Request Granularity 27 10% computation in locks 1% conflict ratio

Experimental Results: Scalability 28

Causal Events & Performance 29

Improving Performance: Causal Edge Pruning with Vector Clock More causal edges, more overhead Causal edge pruning: trades primary performance for secondary 30 Reduces 58% ~ 99% causal edges

Replicated State Machine 31

Rex: Causal Order Replication 32

Correctness Correctness guaranteed by: 1.Captures all non-determinism with Rex 2.Consensus on traces 3.Agreed trace is a continuous sequence (no holes) 33

Inconsistent Cut: Why Is It Bad? 34 Trace: t1 unlock -> t2 lock -> t2 unlock -> t3 lock reply: 0 Replay: t1 unlock -> t3 lock -> t3 unlock -> t2 lockreply: 1 Should we reply 0 or 1?

Inconsistent Cut: Solving the Reply Problem Reply only when reply and all its dependencies are committed Use a vector clock to detect 35