RAMCloud Design Review Indexing Ryan Stutsman April 1, 2010 1.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.
Crash Recovery John Ortiz. Lecture 22Crash Recovery2 Review: The ACID properties  Atomicity: All actions in the transaction happen, or none happens 
Cache Optimization Summary
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
CSCI 3140 Module 8 – Database Recovery Theodore Chiasson Dalhousie University.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
Multiversion Access Methods - Temporal Indexing. Basics A data structure is called : Ephemeral: updates create a new version and the old version cannot.
Lecture 18 ffs and fsck. File-System Case Studies Local FFS: Fast File System LFS: Log-Structured File System Network NFS: Network File System AFS: Andrew.
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Chapter Trees and B-Trees.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Chapter 13.4 Hash Tables Steve Ikeoka ID: 113 CS 257 – Spring 2008.
B+ - Tree & B - Tree By Phi Thong Ho.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
CS 603 Data Replication February 25, Data Replication: Why? Fault Tolerance –Hot backup –Catastrophic failure Performance –Parallelism –Decreased.
CS4432: Database Systems II
Multiprocessor Cache Coherency
Indexing and Hashing (emphasis on B+ trees) By Huy Nguyen Cs157b TR Lee, Sin-Min.
RAMCloud Design Review Recovery Ryan Stutsman April 1,
B+ Tree What is a B+ Tree Searching Insertion Deletion.
Oracle9i Database Administrator: Implementation and Administration 1 Chapter 9 Index Management.
RAMCloud: Concept and Challenges John Ousterhout Stanford University.
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
Data and its manifestations. Storage and Retrieval techniques.
Index tuning Performance Tuning. Overview Index An index is a data structure that supports efficient access to data Set of Records index Condition on.
Physical DB Issues, Indexes, Query Optimisation Database Systems Lecture 13 Natasha Alechina.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro,
RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,
Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
M1G Introduction to Database Development 5. Doing more with queries.
Serverless Network File Systems Overview by Joseph Thompson.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 11: File System Implementation.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
File Systems Operating Systems 1 Computer Science Dept Va Tech August 2007 ©2007 Back File Systems & Fault Tolerance Failure Model – Define acceptable.
Optimistic Methods for Concurrency Control By: H.T. Kung and John Robinson Presented by: Frederick Ramirez.
Range Queries in Non-blocking k-ary Search Trees Trevor Brown Hillel Avni.
CS4432: Database Systems II Query Processing- Part 2.
CS333 Intro to Operating Systems Jonathan Walpole.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
I MPLEMENTING FILES. Contiguous Allocation:  The simplest allocation scheme is to store each file as a contiguous run of disk blocks (a 50-KB file would.
Chapter 5 Index and Clustering
An Architecture for Mobile Databases By Vishal Desai.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Lecture 20 FSCK & Journaling. FFS Review A few contributions: hybrid block size groups smart allocation.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Table Structures and Indexing. The concept of indexing If you were asked to search for the name “Adam Wilbert” in a phonebook, you would go directly to.
11th International Conference on Web-Age Information Management July 15-17, 2010 Jiuzhaigou, China V Locking Protocol for Materialized Aggregate Join Views.
Prepared by: Mudra Patel (113) Pradhyuman Raol(114) Locking Scheduler & Managing Hierarchies of Database Elements.
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
COMP 430 Intro. to Database Systems Transactions, concurrency, & ACID.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Transactions in PostgreSQL
Are they better or worse than a B+Tree?
Multiprocessor Cache Coherency
Andy Wang Operating Systems COP 4610 / CGS 5765
Overview Continuation from Monday (File system implementation)
Printed on Monday, December 31, 2018 at 2:03 PM.
Andy Wang Operating Systems COP 4610 / CGS 5765
Sarah Diesburg Operating Systems COP 4610
Presentation transcript:

RAMCloud Design Review Indexing Ryan Stutsman April 1,

Introduction Should RAMCloud provide indexing? o Leave indexes to client-side using transactions? Many apps have similar indexing needs o Hash indexes, B+Trees, etc. o Can reduce app visible latency for indexes by optimizing server-side 2

Implementation Issues Indexing on “opaque” data Splitting Indexes Consistency Recovery/Availability of Indexes 3

Explicit Search Keys Problem: RAMCloud treats objects as opaque o Server-side indexing without understanding the data? Max Power (650) put(tableId, person.objectId, person.pickle()) 4

Explicit Search Keys Problem: RAMCloud treats objects as opaque o Server-side indexing without understanding the data? Idea: Apps provide search keys explicitly o Apps understand the data put(tableId, person.objectId, {‘first’: person.first, ‘last’: person.last}, person.pickle()) Powerlast field IDfirst field IDMaxMax Power (650)

Explicit Search Keys Problem: RAMCloud treats objects as opaque o Server-side indexing without understanding the data? Idea: Apps provide search keys explicitly o Apps understand the data Can eliminate redundancy o Search keys need not be repeated in object o Search keys + Blob are returned to app on get/lookup put(tableId, person.objectId, {‘first’: person.first, ‘last’: person.last}, person.pickle()) 6 Powerlast field IDfirst field IDMax(650)

Explicit Search Keys Put atomically updates indexes and object o Details to follow put(tableId, objectId, searchKeys, blob) get(tableId, objectId) –> (searchKeys, blob) lookup(tableId, indexName, searchValue) -> (searchKeys, blob) 7

Splitting Indexes Co-locate index and data Large tables? Large indexes? o Can’t avoid multi-machine operations Index A-Z Data 0-99 Master CMaster B Data Master A Index A-Z 8 Master A

Splitting Indexes Split indexes on search key o One extra access per lookup and put Split indexes on object ID o Lookups go to all index fragments o Puts are always local Our decision (for now): On search key o Don’t want weakest-link lookup performance Index Data Index Data Index 0-99 Data 0-99 Data Index A-R Index S-Z Data

Consistency Problem: Index/Object inconsistency on puts o Object and index may reside on different hosts o Apps can get objects that aren’t in the index yet o Apps may see index entries for objects not in table yet Avoid commit protocol Idea: Index entries “commit” on object put o Write index entries o Then write object to table o Index entries considered invalid until object written Turns atomic puts into atomic index updates 10

Consistency Powell300 Powers299 Mary299 Mel lastName Index firstName Index Mary PowersMel Powell Data Table

Consistency: Lookup Powell300 Powers299 lookup(0, ‘last’, ‘Power’) Mary299 Mel300 Request goes directly to correct index o “Not found” returns immediately 12 lastName Index firstName Index Mary PowersMel Powell Data Table

Consistency: Lookup Powell300 Powers299 lookup(0, ‘last’, ‘Powell’) Mary299 Mel300 ‘Powell’ == ‘Powell’ ok Consistency is checked on hit o If table and index agree the return the object o Else “not found” lastName Index firstName Index 300 Mary PowersMel Powell Data Table

Consistency: Create Powell300 Powers299 put(0, 301, {‘first’: ‘Max’, ‘last’: ‘Power’}, person.pickle()) Mary299 Mel Insert index entries before writing object lastName Index firstName Index Mary PowersMel Powell Data Table

Consistency: Create Powell300 Power301 Powers299 put(0, 301, {‘first’: ‘Max’, ‘last’: ‘Power’}, person.pickle()) Mary299 Mel300 Insert index entries before writing object o What if a lookup happens in the meantime? 15 lastName Index firstName Index Mary PowersMel Powell Data Table

Consistency: Concurrent Lookup Powell300 Power301 Powers299 put(0, 301, {‘first’: ‘Max’, ‘last’: ‘Power’}, person.pickle()) Mary299 Mel300 lookup(0, ‘last’, ‘Power’) Concurrent ops ignore inconsistent entries 16 lastName Index firstName Index Mary PowersMel Powell Data Table

Mary PowersMel Powell Data Table Consistency: Concurrent Lookup Powell300 Power301 Powers299 put(0, 301, {‘first’: ‘Max’, ‘last’: ‘Power’}, person.pickle()) Mary299 Mel300 lookup(0, ‘last’, ‘Power’) Not Found Concurrent ops ignore inconsistent entries lastName Index firstName Index 301

Consistency: Create (continued) Powell300 Power301 Powers299 put(0, 301, {‘first’: ‘Max’, ‘last’: ‘Power’}, person.pickle()) Mary299 Max301 Mel300 Insert index entries before writing object 18 lastName Index firstName Index Mary PowersMel Powell Data Table

Mary PowersMel Powell Data Table Consistency: Create Powell300 Power301 Powers299 put(0, 301, {‘first’: ‘Max’, ‘last’: ‘Power’}, person.pickle()) Mary299 Max301 Mel300 Max Power Put completes; index entries now valid 19 lastName Index firstName Index 301

Consistency: Delete Powell300 Power301 Powers299 delete(0, 301) Mary299 Max301 Mel300 Max Power 20 Delete object first, then cleanup index entries o Index entries are invalid with no corresponding object lastName Index firstName Index Mary PowersMel Powell Data Table Max Power 301

Consistency: Delete Powell300 Power301 Powers299 delete(0, 301) Mary299 Max301 Mel Delete object first, then cleanup index entries o Index entries are invalid with no corresponding object lastName Index firstName Index Mary PowersMel Powell Data Table

Mary PowersMel Powell Data Table Consistency: Delete Powell300 Powers299 delete(0, 301) Mary299 Mel Delete object first, then cleanup index entries o Index entries are invalid with no corresponding object lastName Index firstName Index

Consistency: Update Powell300 Powers299 put(0, 299, {‘first’: ‘Mary’, ‘last’: ‘Miller’}, person.pickle()) Mary299 Mel lastName Index firstName Index Mary PowersMel Powell Data Table

Consistency: Update Miller299 Powell300 Powers299 put(0, 299, {‘first’: ‘Mary’, ‘last’: ‘Miller’}, person.pickle()) Mary299 Mel300 Compare previous index entries o Insert new value if updated 24 lastName Index firstName Index Mary PowersMel Powell Data Table

Consistency: Update Miller299 Powell300 Powers299 put(0, 299, {‘first’: ‘Mary’, ‘last’: ‘Miller’}, person.pickle()) Mary299 Mel300 Commit by writing the new value o Old index entries ignored by lookup since inconsistent 25 lastName Index firstName Index Mary MillerMel Powell Data Table

Consistency: Update Miller299 Powell300 put(0, 299, {‘first’: ‘Mary’, ‘last’: ‘Miller’}, person.pickle()) Mary299 Mel300 Cleanup old, inconsistent entries 26 lastName Index firstName Index Mary MillerMel Powell Data Table

Consistency: Thoughts Atomic puts give index updates atomicity Low-latency gives simplified consistency o Can afford to have a single writer per object o Provides us with atomic put primitive for free 27

Index Recovery Problem: Unavailable until indexes recover o Many requests will be lookups o These will block until indexes are recovered Rebuild versus Store? o Storing comes at a cost to write-bandwidth o Possible using scale we can rebuild faster than store 28

Index Recovery: Partitioning How far does partitioning + rebuilding get us? Worst case: Entire partition of index data only o At most 640 MB o Larger indexes recovered a partition to a host in parallel 29

Index Recovery: Partitioning Recover a single index partition on a new master: 1.Data partitions scan, extract index entries (0.6s) o Hashtable: 10 million lookups/sec o 640 MB / 100 byte/object = 6.4 million objects 2.Transmit entries to new index partition (0.6s) o At most Gbit/s 3.New index master reinsert entries (0.6s)  Similar time to master hashtable scan All operations are pipelined o 0.6s to scan, extract, transmit, rebuild total If data partitions for index in recovery add 0.6s o 1.2s upper bound for conservative 100b object size 30

Summary Explicit search keys both flexible and efficient Split indexes on search key for fast lookup Atomic puts simplify atomic indexes Scale drives index recovery for availability 31

Discussion 32