Clock-SI: Snapshot Isolation for Partitioned Data Stores

Slides:



Advertisements
Similar presentations
Chen Zhang Hans De Sterck University of Waterloo
Advertisements

Time-based Transactional Memory with Scalable Time Bases Torvald Riegel, Christof Fetzer, Pascal Felber Presented By: Michael Gendelman.
Dynamo: Amazon’s Highly Available Key-value Store
Consistency Guarantees and Snapshot isolation Marcos Aguilera, Mahesh Balakrishnan, Rama Kotla, Vijayan Prabhakaran, Doug Terry MSR Silicon Valley.
Privatization Techniques for Software Transactional Memory Michael F. Spear, Virendra J. Marathe, Luke Dalessandro, and Michael L. Scott University of.
High throughput chain replication for read-mostly workloads
Omid Efficient Transaction Management and Incremental Processing for HBase Copyright © 2013 Yahoo! All rights reserved. No reproduction or distribution.
Serializable Isolation for Snapshot Databases Michael J. Cahill, Uwe Röhm, and Alan D. Fekete University of Sydney ACM Transactions on Database Systems.
Causal Consistency Without Dependency Check Messages Willy Zwaenepoel.
1 Database Replication Using Generalized Snapshot Isolation Sameh Elnikety, EPFL Fernando Pedone, USI Willy Zwaenepoel, EPFL.
Middleware based Data Replication providing Snapshot Isolation Yi Lin Bettina Kemme Marta Patiño-Martínez Ricardo Jiménez-Peris June 15, 2005.
Predicting Replicated Database Scalability Sameh Elnikety, Microsoft Research Steven Dropsho, Google Inc. Emmanuel Cecchet, Univ. of Mass. Willy Zwaenepoel,
Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Distributed Snapshots –Termination detection Election algorithms –Bully –Ring.
Synchronization. Physical Clocks Solar Physical Clocks Cesium Clocks International Atomic Time Universal Coordinate Time (UTC) Clock Synchronization Algorithms.
Session - 14 CONCURRENCY CONTROL CONCURRENCY TECHNIQUES Matakuliah: M0184 / Pengolahan Data Distribusi Tahun: 2005 Versi:
Ordering of events in Distributed Systems & Eventual Consistency Jinyang Li.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Logical Time Steve Ko Computer Sciences and Engineering University at Buffalo.
GentleRain: Cheap and Scalable Causal Consistency with Physical Clocks Jiaqing Du | Calin Iorgulescu | Amitabha Roy | Willy Zwaenepoel École polytechnique.
Clock-RSM: Low-Latency Inter-Datacenter State Machine Replication Using Loosely Synchronized Physical Clocks Jiaqing Du, Daniele Sciascia, Sameh Elnikety.
Network File Systems Victoria Krafft CS /4/05.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
1 The Google File System Reporter: You-Wei Zhang.
Orbe: Scalable Causal Consistency Using Dependency Matrices & Physical Clocks Jiaqing Du, EPFL Sameh Elnikety, Microsoft Research Amitabha Roy, EPFL Willy.
School of Information Technologies Michael Cahill 1, Uwe Röhm and Alan Fekete School of IT, University of Sydney {mjc, roehm, Serializable.
System Support for Managing Graphs in the Cloud Sameh Elnikety & Yuxiong He Microsoft Research.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Low Cost Commit Protocols for Mobile Computing Environments Marc Perron & Baochun Bai.
Computer Science Lecture 13, page 1 CS677: Distributed OS Last Class: Canonical Problems Distributed synchronization and mutual exclusion Distributed Transactions.
1 Admission Control and Request Scheduling in E-Commerce Web Sites Sameh Elnikety, EPFL Erich Nahum, IBM Watson John Tracey, IBM Watson Willy Zwaenepoel,
Computer Science Lecture 13, page 1 CS677: Distributed OS Last Class: Canonical Problems Election algorithms –Bully algorithm –Ring algorithm Distributed.
1 Multiversion Reconciliation for Mobile Databases Shirish Hemanath Phatak & B.R.Badrinath Presented By Presented By Md. Abdur Rahman Md. Abdur Rahman.
Database Replication in WAN Yi Lin Supervised by: Prof. Kemme April 8, 2005.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Computer Science Lecture 13, page 1 CS677: Distributed OS Last Class: Canonical Problems Election algorithms –Bully algorithm –Ring algorithm Distributed.
Minimizing Commit Latency of Transactions in Geo-Replicated Data Stores Paper Authors: Faisal Nawab, Vaibhav Arora, Divyakant Argrawal, Amr El Abbadi University.
Centiman: Elastic, High Performance Optimistic Concurrency Control by Watermarking Authors: Bailu Ding, Lucja Kot, Alan Demers, Johannes Gehrke Presenter:
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
PV204 Security technologies
Cassandra - A Decentralized Structured Storage System
Slide credits: Thomas Kao
Introduction to Distributed Platforms
Last Class: Canonical Problems
Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li
Alternative system models
LogSI-HTM: Log Based Snapshot Isolation in
Dynamo: Amazon’s Highly Available Key-value Store
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Operational & Analytical Database
Lecturer : Dr. Pavle Mogin
CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.
CSCI5570 Large Scale Data Processing Systems
Time and Clock Primary standard = rotation of earth
Two phase commit.
The SNOW Theorem and Latency-Optimal Read-Only Transactions
Nicholas Chan Himel Dev
Advanced Operating Systems
EECS 498 Introduction to Distributed Systems Fall 2017
I Can’t Believe It’s Not Causal
EECS 498 Introduction to Distributed Systems Fall 2017
Providing Secure Storage on the Internet
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
Admission Control and Request Scheduling in E-Commerce Web Sites
Printed on Monday, December 31, 2018 at 2:03 PM.
From Viewstamped Replication to BFT
Bandwidth Requirement
Decoupled Storage: “Free the Replicas!”
Concurrency control (OCC and MVCC)
Presentation transcript:

Clock-SI: Snapshot Isolation for Partitioned Data Stores Jiaqing Du, EPFL Sameh Elnikety, MSR Redmond Willy Zwaenepoel, EPFL Clock-SI: Snapshot Isolation for Partitioned Data Stores Using Loosely Synchronized Clocks. Jiaqing Du, Sameh Elnikety, Willy Zwaenepoel. SRDS 2013, Braga, Portugal, 30 September - 3 October 2013.

Key Idea Completely decentralized implementation of SI Current solutions have centralized component Better availability, latency, throughput

Snapshot Isolation (SI) Popular multi-version concurrency control

SI, Informally T start: T execution: T commit: Get snapshot Reads from snapshot Writes to private workspace T commit: Check for write-write conflicts Install updates

SI, More Formally Total order on commits Transactions read from consistent snapshot No write-write conflicts between // transactions

Consistent Snapshot Writes of all committed transactions before No other writes visible

Total Order on Commits Commit T1 x=x1, y=y1 Commit T2 y=y2 Commit T3

Snapshots Commit T1 x=x1, y=y1 Commit T2 y=y2 Commit T3 x=x3 Start T4

Timestamps Commit timestamp Snapshot timestamp Assigned at commit Order of commit in commit total order Snapshot timestamp Assigned at start Commit timestamp of last committed transaction

Commit Timestamps 1 2 3 Commit T1 x=x1, y=y1 Commit T2 y=y2 Commit T3 Start T4 x1, y1 Start T5 x1, y2 Start T6 x1, y2

Snapshot Timestamps 1 2 3 1 2 2 Commit T1 x=x1, y=y1 Commit T2 y=y2 Start T4 x1, y1 Start T5 x1, y2 Start T6 x1, y2 1 2 2

Multi-Versioning Commit T1 x=x1, y=y1 commit counter 1 x x1,1 y y1,1

Multi-Versioning Commit T1 x=x1, y=y1 Commit T2 y=y2 commit counter 1

Multi-Versioning Commit T1 x=x1, y=y1 Commit T2 y=y2 Commit T3 x=x3 counter 1 2 3 x x1,1 x3,3 y y1,1 y2,2

Reads from Snapshot 2 Commit T1 x=x1, y=y1 Commit T2 y=y2 Commit T3 counter 1 2 3 x x1,1 x3,3 y y1,1 y2,2 Start T5 x1, y2 2

Reads from Snapshot 2 Commit T1 x=x1, y=y1 Commit T2 y=y2 Commit T3 counter 1 2 3 x x1,1 x3,3 y y1,1 y2,2 Start T5 x1, y2 r5(x) -> x1 2

Partitioned Data Store

Transaction in Partitioned Data Store

Simplification In this talk: In the paper: Only single-partition update transactions In the paper: Multi-partition update transactions Using (augmented) 2PC

Conventional SI in Partitioned Data Store [Percolator, OSDI 2010] Timestamp Authority T

Conventional SI: Transaction Start P1 P2 P3 Timestamp Authority T snapshot timestamp

Conventional SI: Transaction Execution P1 P2 P3 Timestamp Authority T

Conventional SI: Transaction Commit P1 P2 P3 Timestamp Authority T commit timestamp

Problems Single point of failure Latency (esp. with geo-partitioning) Throughput

Clock-SI: Key Idea Eliminate central timestamp authority Replace by loosely coupled clocks

Clock-SI P1 P2 P3 Timestamp Authority T

Clock-SI: Transaction Start P1 P2 P3 T snapshot timestamp

Clock-SI: Transaction Execution P1 P2 P3 T

Clock-SI: Transaction Commit P1 P2 P3 T commit timestamp

SI? Total order on commits Transactions read from consistent snapshot No write-write conflicts between // transactions

SI: Total Order on Commits? Ordered by commit timestamp (as before) Ties broken by partition-id

SI: Consistent Snapshot? Challenges Clock skew Commit duration

Clock-SI Challenges: Clock Skew P1 t P2 t Timeline

Clock-SI Challenges: Clock Skew P1 T1.start() T1.read(x) t P2 t Timeline

Clock-SI Challenges: Clock Skew P1 T1.start() T1.read(x) t P2 t Timeline

Clock-SI Challenges: Clock Skew P1 T1.start() T1.read(x) t P2 delay t Timeline

Clock-SI Challenges: Commit Duration P1 t T1.write(x) T1.commit

Clock-SI Challenges: Commit Duration P1 t T1.write(x) T1.commit T1.commit.finished

Clock-SI Challenges: Commit Duration P1 t t’ T1.write(x) T1.commit T1.commit.finished T2.start T2.read(x)

Clock-SI Challenges: Commit Duration P1 t t’ T1.write(x) T1.commit T1.commit.finished delay T2.start T2.read(x)

Reducing Delay Snapshot timestamp = Clock() – Δ If Δ > max commit delay + max clock skew Then no delays

Clock-SI: Advantages No single point of failure No roundtrip latency Throughput not limited by single machine Tradeoff: May delay reads for short time May read slightly stale data

Evaluation Partitioned key-value store Latency numbers Keep index and versioned data in memory Commit operation log to disk synchronously Clocks are synchronize by NTP (peer mode) Latency numbers Synchronous disk write: 6.7ms Round-trip network latency: 0.14ms

Latency Roundtrip(s) eliminated WAN: important LAN: Important for short read-only T’s Less important for long or update T’s

LAN Transaction Latency

Read Scalability

Write Scalability

Delay Probability Analytical model in the paper Large data set, random access Very low probability Small data set, hotspot Some probability Choosing older snapshot helps

Hot Spots Read Throughput

Hot Spots Read Latency

In the Paper Augmented 2PC protocol Analytical model More results on abort and delay probability

Conclusion Clock-SI = SI for partitioned data stores Based on loosely synchronized clocks Fully distributed No single point of failure Good latency Good throughput Tradeoff Occasional delays or stale data