Evaluating and Optimizing Indexing Schemes for a Cloud-based Elastic Key- Value Store Apeksha Shetty and Gagan Agrawal Ohio State University David Chiu.

Slides:



Advertisements
Similar presentations
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Advertisements

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Precept 6 Hashing & Partitioning 1 Peng Sun. Server Load Balancing Balance load across servers Normal techniques: Round-robin? 2.
1 NETE4631 Cloud deployment models and migration Lecture Notes #4.
B+-tree and Hashing.
Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications Stoica et al. Presented by Tam Chantem March 30, 2007.
Efficient Storage and Retrieval of Data
Object Naming & Content based Object Search 2/3/2003.
1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
E.G.M. Petrakissearching1 Searching  Find an element in a collection in the main memory or on the disk  collection: (K 1,I 1 ),(K 2,I 2 )…(K N,I N )
Data Structures Hash Table (aka Dictionary) i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst, Brian Hayes, Andreas Veneris, Glenn Brookshear,
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
1 Lecture 7: Data structures for databases I Jose M. Peña
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.
Ch 4. The Evolution of Analytic Scalability
Middleware Enabled Data Sharing on Cloud Storage Services Jianzong Wang Peter Varman Changsheng Xie 1 Rice University Rice University HUST Presentation.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
The New Zealand Institute for Plant & Food Research Limited Use of Cloud computing in impact assessment of climate change Kwang Soo Kim and Doug MacKenzie.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Kaleidoscope – Adding Colors to Kademlia Gil Einziger, Roy Friedman, Eyal Kibbar Computer Science, Technion 1.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Lecture 12 Distributed Hash Tables CPE 401/601 Computer Network Systems slides are modified from Jennifer Rexford.
Chord Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,
1 Lecture 21: Hash Tables Wednesday, November 17, 2004.
Elastic Cloud Caches for Accelerating Service-Oriented Computations Gagan Agrawal Ohio State University Columbus, OH David Chiu Washington State University.
Data Indexing in Peer- to-Peer DHT Networks Garces-Erice, P.A.Felber, E.W.Biersack, G.Urvoy-Keller, K.W.Ross ICDCS 2004.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Bigtable: A Distributed Storage System for Structured Data
CS694 - DHT1 Distributed Hash Table Systems Hui Zhang University of Southern California.
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
Dynamic Hashing (Chapter 12)
The Variable-Increment Counting Bloom Filter
(slides by Nick Feamster)
Lecture 21: Hash Tables Monday, February 28, 2005.
Database Management Systems (CS 564)
PA an Coordinated Memory Caching for Parallel Jobs
By Apeksha Shetty Thesis Committee: Dr. Gagan Agrawal
Dewan Tanvir Ahmed and Shervin Shirmohammadi
Edge computing (1) Content Distribution Networks
Ch 4. The Evolution of Analytic Scalability
AWS Cloud Computing Masaki.
EECS 498 Introduction to Distributed Systems Fall 2017
RUM Conjecture of Database Access Method
Database Design and Programming
Wednesday, 5/8/2002 Hash table indexes, physical operators
Consistent Hashing and Distributed Hash Table
Presentation transcript:

Evaluating and Optimizing Indexing Schemes for a Cloud-based Elastic Key- Value Store Apeksha Shetty and Gagan Agrawal Ohio State University David Chiu Washington State University

CCGrid 2011, May 23-26, Newport Beach, CA 2 A Truncated Intro to Cloud Computing Pay-As-You-Go Computing We focus on IaaS ‣ E.g., Amazon EC2 Elasticity

CCGrid 2011, May 23-26, Newport Beach, CA 3 Elasticity and Distributed Hash Tables (DHT) ‣ IaaS Cloud: ‣ Applications can incrementally scale and relax resource requirements on demand --- Elasticity ‣ Distributed Hash Tables: ‣ Manages distributed storage over many commodity nodes ‣ Increasingly popular to provide massive, reliable storage ‣ e.g., Facebook’s Cassandra, Amazon Dynamo, many P2P networks etc.

CCGrid 2011, May 23-26, Newport Beach, CA 4 Elasticity and DHT (cont.) ‣ Clearly, DHTs can benefit from elasticity by harnessing more/less nodes as needed. ‣ But performance of a DHT can be greatly affected by the indexing mechanism on each cooperating node. ‣ We evaluate the effects of three popular indexes: B+Tree, extendible hashing, bloom filters

CCGrid 2011, May 23-26, Newport Beach, CA 5 Outline ‣ Intro to Distributed Hash Tables ‣ Three Indexing Schemes ‣ Performance Evaluation ‣ Conclusion

CCGrid 2011, May 23-26, Newport Beach, CA 6 Outline ‣ Intro to Distributed Hash Tables ‣ Three Indexing Schemes ‣ Performance Evaluation ‣ Conclusion

CCGrid 2011, May 23-26, Newport Beach, CA 7 A B 8 Anatomy of a DHT using Consistent Hashing r - 1 Nodes: Data on each node is further indexed using one of the indexing schemes

CCGrid 2011, May 23-26, Newport Beach, CA 8 A B 8 Anatomy of a DHT using Consistent Hashing r - 1 Buckets: Points to at most one storage node.

CCGrid 2011, May 23-26, Newport Beach, CA Querying over the DHT: Clock-wise Successor Cache Requests (k mod r) r - 1 A B 8 25

CCGrid 2011, May 23-26, Newport Beach, CA Overflow and Splitting Nodes in the DHT r - 1 B 8 25 A High traffic region Overflow

CCGrid 2011, May 23-26, Newport Beach, CA Overflow and Splitting Nodes in the DHT r - 1 B 8 25 A C Migrate nearly half the keys hashing into (76,8] range to node C Cache Requests (k mod r) IaaS Cloud (EC2)

CCGrid 2011, May 23-26, Newport Beach, CA After Split and Incremental Scaling r - 1 B 8 25 A C Cache Requests (k mod r)

CCGrid 2011, May 23-26, Newport Beach, CA 13 Outline ‣ Intro to Distributed Hash Tables ‣ Three Indexing Schemes ‣ Performance Evaluation ‣ Conclusion

CCGrid 2011, May 23-26, Newport Beach, CA 14 Indexing Schemes per Node ‣ On each DHT node an index is used to provide fast access to the key-value data pair. ‣ We evaluate three popular schemes: B+Trees Extendible Hashing Bloom Filters ‣ Each scheme may impact split and migration time significantly.

CCGrid 2011, May 23-26, Newport Beach, CA 15 B+Trees ‣ Always balanced ‣ Leaf level is sorted ascending order on the search key

CCGrid 2011, May 23-26, Newport Beach, CA 16 B+Trees (cont.) ‣ Key Point Queries: O(log n) ‣ Fast Key Range Queries [k_low, k_high]: Point search for k_low: O(log n) Linearly sweep leaf nodes in order until k_high Significant when splitting/migrating! ‣ Popular option for DHT nodes

CCGrid 2011, May 23-26, Newport Beach, CA 17 Extendible Hashing ‣ Each bucket holds M keys ‣ A key is hashed into a bucket via examining least significant bit(s) of incoming key

CCGrid 2011, May 23-26, Newport Beach, CA 18 Extendible Hashing (cont.) ‣ Key Point Queries: O(1) ‣ Slow Key Range Queries [k_low, k_high]: Hashing disrupts natural key ordering Finding every key in range requires a separate hash ‣ Data structure grows exponentially

CCGrid 2011, May 23-26, Newport Beach, CA 19 Bloom Filters ‣ Highly space-efficient, but probabilistic ‣ Often useful as a secondary index for efficiently determining key existence ‣ K hash functions are applied to a key, if any function hashes to 0/false, then the key is not in the set

CCGrid 2011, May 23-26, Newport Beach, CA 20 Bloom Filters (cont.) ‣ Same analysis as Extendible Hashing, but must deal with false positives.

CCGrid 2011, May 23-26, Newport Beach, CA 21 Outline ‣ Intro to Distributed Hash Tables ‣ Three Indexing Schemes ‣ Performance Evaluation ‣ Conclusion

CCGrid 2011, May 23-26, Newport Beach, CA 22 Experimental Configuration Application: Service Caching ‣ Shoreline Extraction ‣ Shoreline Extraction service checks DHT for cached copy ‣ 65K distinct point queries, submitted randomly ‣ DHT starts cold, 1 EC2 node ‣ Amazon EC2  Small Instances (Single core 1.2Ghz, 1.7GB mem, 32-bits)  Ubuntu (9.10) Linux

CCGrid 2011, May 23-26, Newport Beach, CA 23 Elastic Feasibility of DHT (Using B+Tree) ** Migration occurs at each EC2 Node Alloc

CCGrid 2011, May 23-26, Newport Beach, CA 24 Elastic Feasibility of DHT (Using B+Tree) ** Migration occurs at each EC2 Node Alloc

CCGrid 2011, May 23-26, Newport Beach, CA 25 Experimental Configuration ‣ The same experiment is run a minimum of 3 times (we report the average) ‣ Over the following indexing configurations: B+Tree Extendible Hashing, bucket size = 100 keys (EH100) Extendible Hashing, bucket size = 300 keys (EH300) Extendible Hashing, bucket size = 500 keys (EH500) Bloom Filter (CBF)

CCGrid 2011, May 23-26, Newport Beach, CA 26 Execution Time: 50 requests/sec Querying Rate: 50 requests/sec

CCGrid 2011, May 23-26, Newport Beach, CA 27 Execution Time: 50 requests/sec Querying Rate: 50 requests/sec

CCGrid 2011, May 23-26, Newport Beach, CA 28 Node Split/Migration: 50 requests/sec ** 7 migrations throughout execution

CCGrid 2011, May 23-26, Newport Beach, CA 29 Execution Time: 250 requests/sec Querying Rate: 250 requests/sec

CCGrid 2011, May 23-26, Newport Beach, CA 30 Node Split/Migration: 250 requests/sec ** 15 migrations throughout execution

CCGrid 2011, May 23-26, Newport Beach, CA 31 A Small Optimization ‣ We can speculate node split potential and pre-launch instances ‣ Launch a new instance when the total number of keys in any node reach T

CCGrid 2011, May 23-26, Newport Beach, CA 32 Pre-launching: Query Rate: 50 requests/sec ** 7 migrations throughout execution

CCGrid 2011, May 23-26, Newport Beach, CA 33 Experimental Summary ‣ In general, As expected, B+Tree performs best for splitting and migration on average Bloom Filters should be avoided, but can be useful as a space- constrained secondary index

CCGrid 2011, May 23-26, Newport Beach, CA 34 Experimental Summary (cont) ‣ If the environment is point-query heavy, an optimally configured Extendible Hash index can outperform B+Trees But hard to configure in practice ‣ Our instance startup speculation heuristic can improve the node splitting overhead by 4x and 14x for EH300 and Bloom Filters

CCGrid 2011, May 23-26, Newport Beach, CA 35 Related Works ‣ Web caching, Proxy memcached CRISP Proxy (Rabinovich, et al.) ‣ DHT for P2P File Sharing Chord (Stoica, et al.) ‣ Other Key-Value data stores Dynamo (Amazon) Cassandra (Facebook)

CCGrid 2011, May 23-26, Newport Beach, CA 36 Thank You, and Acknowledgments ‣ Questions and Comments David Chiu - Gagan Agrawal - ‣ This project was supported by: