Spencer MacBeth Supervisor - Dr. Ramon Lawrence

Slides:



Advertisements
Similar presentations
CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
1 Linear Hashing Appendix for Chapter 1. 2 Linear Hashing Allow a hash file to expand and shrink dynamically without needing a directory. Suppose the.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Hash Table indexing and Secondary Storage Hashing.
B+-tree and Hashing.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Design and Analysis of Algorithms - Chapter 71 Hashing b A very efficient method for implementing a dictionary, i.e., a set with the operations: – insert.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Appendix E-A Hashing Modified. Chapter Scope Concept of hashing Hashing functions Collision handling – Open addressing – Buckets – Chaining Deletions.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
1 Lecture 21: Hash Tables Wednesday, November 17, 2004.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Chapter 5 Record Storage and Primary File Organizations
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
CS 245: Database System Principles
CS522 Advanced database Systems
Relational Database Systems 2
Data Structures Using C++ 2E
Hashing, Hash Function, Collision & Deletion
Lecture 16: Data Storage Wednesday, November 6, 2006.
Dynamic Hashing (Chapter 12)
LEARNING OBJECTIVES O(1), O(N) and O(LogN) access times. Hashing:
Lecture 21: Hash Tables Monday, February 28, 2005.
Are they better or worse than a B+Tree?
Hashing CENG 351.
Database Management System
CPSC-608 Database Systems
Subject Name: File Structures
Data Structures Using C++ 2E
Review Graph Directed Graph Undirected Graph Sub-Graph
File Organizations Chapter 8 “How index-learning turns no student pale
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Hash Table.
Arrays and Linked Lists
Hash Tables.
Chapter 10 Hashing.
Hashing.
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
External Memory Hashing
Indexing and Hashing Basic Concepts Ordered Indices
CSCE 3110 Data Structures & Algorithm Analysis
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Advance Database System
CS202 - Fundamental Structures of Computer Science II
Database Systems (資料庫系統)
LINEAR HASHING E0 261 Jayant Haritsa Computer Science and Automation
Database Design and Programming
2018, Spring Pusan National University Ki-Joune Li
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Chapter 12 Query Processing (1)
CPS216: Advanced Database Systems
Module 12a: Dynamic Hashing
CPSC-608 Database Systems
Hash-Based Indexes Chapter 11
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Dictionaries and Hash Tables
Lecture-Hashing.
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Linear Hashing for Flash Memory on Resource-Constrained Microprocessors Spencer MacBeth Supervisor - Dr. Ramon Lawrence Faculty - Computer Science

Overview Arduinos IonDB Flash Memory - An inexpensive, extensible, computing device with limited resources. - Capable of taking in input through an array of accessories, interfacing with it, and producing output - A high-performance implementation of a map data structure that can run in the Arduino environment - Currently has an interface which can use several different implementations with different tradeoffs - Becoming increasingly popular - Used in many small devices, including Arduinos - Has asymmetric read and write performance - Algorithms can adapted to exploit this property of flash memory There scope of this project is largely defined by 3 domains – Arduinos are… IonDB is… Flash Memory…

Research Objective: Assess the performance of the linear hash on the Arduino platform

The linear hash data structure has near-optimal performance for the basic hash table operations The linear hash maintains its performance while using little main memory Currently there is no implementation of a linear hash data structure for the Arduino platform Motivations

The Linear Hash

Terminology Stores a set number of records Bucket Storage units in 2D linked-list structure Overflow Bucket When a new record maps to a full bucket, an overflow bucket is created Load The number of records in the table divided by the table’s current capacity Split Create new bucket and redistribute records in bucket pointed to by the split pointer Before we explore the implementation details, we need to become familiar with the some termis Linear hashes have buckets Linear hashes have overflow buckets Linear hashes are always operating at some load Split operations are performed periodicaly to maintain equal record distribution amongst buckets

Diagram Records per bucket = 4 Capacity = 4 * 4 = 16 Load = 14 / 16 = 87.5% Split pointer = Bucket 0 Here is a diagram visualizing the structure of the linear hash The dimension of the linked list shown running horizontally is all of the indexed buckets The overflow buckets extend vertically downward. Capcity excludes overflow buckets Split pointer at bucket 0

Properties Constant Time Operations Insert, update, get, and delete run in O(1) Cost of splits remains relatively constant Linear Memory Usage Size of table grows linearly in proportion to the number of records Average Bucket Load Relatively Constant Periodic splitting of buckets Hash function used makes a difference Configurable Different parameter values can be used in different environments This setup leads to some desirable properties…

Insertion Example Linear Hash Table start_size = 4; split_pointer = bucket 0; split_threshold = .80 h0(k) = k mod start_size h1(k) = k mod (2 * start_size) Insertion Example Suppose we had a linear hash table with the following characteristics: Initial size = 4 buckets 2 records per bucket Next bucket to split = 0 Split performed when load > 80% Records with id 2, 3, 4, 5, and 8 have been inserted Current load = (records / buckets * records_per_bucket) = 62.5% With these concepts in mind, consider a brief example…

Insertion Example Linear Hash Table start_size = 4; split_pointer = bucket 0; split_threshold = .80 h0(k) = k mod start_size h1(k) = k mod (2 * start_size) Insertion Example State after inserting 16 h0(16) = 16 mod 4 = 0 Bucket 0 is full so an overflow bucket is created Load = 6 / 8 = 75% First we are going to insert record 16 into the table We use the bucket assignment function h0 pictured in the panel on the right to determine which bucket to put it in Bucket 0 is full so we create an overflow bucket

Insertion Example Linear Hash Table start_size = 4; split_pointer = bucket 0; split_threshold = .80 h0(k) = k mod start_size h1(k) = k mod (2 * start_size) Insertion Example State after inserting 9 h0(9) = 9 mod 4 = 1 Load = 7 / 8 = 87.5% Since 87.5% is above the split threshold, a split is performed Insert 9 First we apply the bucket assignment function Bucket 1 not empty so no overflow created The load is now above the split threshold so a split is performed on the bucket pointed to by the split pointer which is bucket 0

Split Example Linear Hash Table start_size = 4; split_pointer = bucket 1; split_threshold = .80 h0(k) = k mod start_size h1(k) = k mod (2 * start_size) Split Example Add a new bucket with an index of n where n is the number of buckets before inserting For each record in the bucket to split: b0 = h0(record.key) b1 = h1(record.key) If (b0 != b1): Delete record from bucket b0 Insert record into bucket b1 Increment split pointer 4 The split is conducted as follows First we create a new indexed bucket Then we apply the following algorithm to the bucket being split Note that the result of h1 will always be the index of the latest bucket created

Implementation for the Arduino Environment

Specifications ATmega2560 Uses flash memory for programs Relatively high storage capacity from micro SD card (a 32 GB Lexar MicroSD card was used during testing) 8KB of RAM 4KB of local storage on device not used Specifications ATmega2560

Strategies Used Swap-on-Delete Reversed Linked-List When deleting a record, pluck the last record in the last bucket for this index and use it to fill the hole in the list. Consequences: Increased insert performance as empty location always known 3 additional disk accesses performed for every delete (read swap bucket, update swap bucket, write swap record) Reversed Linked-List When creating an overflow bucket, instead of updating the tail bucket in the list, the new overflow bucket points to previous head. Consequences: Eliminates additional write during insert (which are more expensive than reads on flash)

Strategies Used Eager Deletions During Swap Bucket Caching In the IonDB standard, all records with the specified key are deleted. If the swap record retrieved has this key, it is deleted immediately. Consequences: Reduces the amount of disk accesses during deletes This gain is proportional to the amount of records with the same key in the table Bucket Caching During operations where all records in a bucket are checked against some condition, the entirety of the bucket and its record is read into main memory. Consequences: Reduces amount of disk accesses by a factor of the amount of records per bucket on average Requires the cache be updated when datafile is mutated

INSERT Operation Insert time remains constant as linear hash grows Time for Inserts vs Records in Table Insert time remains constant as linear hash grows Groupings demonstrate the triggering splitting (highest time grouping) and creation of overflow (mid level grouping) Group where time is consistently very low for inserts is when inserting into a bucket that is not full

GET Operation Constant average retrieval time Time for Gets vs Records in Table Constant average retrieval time Some variance due to randomly generated values, some buckets will have more overflow buckets than others May require scan of linked list of overflow buckets

DELETE Operation Average delete time remains constant Time for Deletes vs Records in Table Average delete time remains constant Some variance again due to randomly generated values, some buckets will have more overflow buckets than others In IonDB standard, performance of delete proportional to the amount of records that share keys

Record Distribution Record distribution remains relatively equal Polynomial string hashing was used on keys before bucket assignment to reduce collisions Record Counts in Bucket Groups Record Count in Group Groups of 50 Consecutive Buckets

Performance Comparisons

Linear Hash vs. Flat File - Arduino Linear hash mean get time = 2.356 ms Flat file mean get time = 60.577 ms Gap when ff falls out of caching

Linear Hash vs. B+ Tree - PC Linear hash mean insert time = 0.148 ms B+ tree mean insert time = 0.160 ms The constant coefficients that affect the b+ tree are small enough that we cannot visualize the logarithmic curve at this scale

Conclusion: The linear hash data structure maintains its constant-time operations on the Arduino platform Swap-on-delete outperforms tombstoning for delete operations when conforming to the IonDB-standard The specific implementation of the linear hash used outperforms a B+ tree data structure on disk

A special thank you to Dr A special thank you to Dr. Ramon Lawrence and Eric Huang for their continuous support and guidance!