A Look at Modern Dictionary Structures & Algorithms Warren Hunt.

Slides:



Advertisements
Similar presentations
Preliminaries Advantages –Hash tables can insert(), remove(), and find() with complexity close to O(1). –Relatively easy to program Disadvantages –There.
Advertisements

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
CS 221 Guest lecture: Cuckoo Hashing Shannon Larson March 11, 2011.
Hash Tables and Associative Containers CS-212 Dick Steflik.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Hashing Text Read Weiss, §5.1 – 5.5 Goal Perform inserts, deletes, and finds in constant average time Topics Hash table, hash function, collisions Collision.
Advanced Algorithms for Massive Datasets Basics of Hashing.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
CSE 326: Data Structures Lecture #13 Extendible Hashing and Splay Trees Alon Halevy Spring Quarter 2001.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
Data Structures Hashing Uri Zwick January 2014.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
MA/CSSE 473 Day 28 Hashing review B-tree overview Dynamic Programming.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
1 Hash Tables  a hash table is an array of size Tsize  has index positions 0.. Tsize-1  two types of hash tables  open hash table  array element type.
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
1 Road Map Associative Container Impl. Unordered ACs Hashing Collision Resolution Collision Resolution Open Addressing Open Addressing Separate Chaining.
IT 60101: Lecture #151 Foundation of Computing Systems Lecture 15 Searching Algorithms.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
1 Symbol Tables The symbol table contains information about –variables –functions –class names –type names –temporary variables –etc.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
MA/CSSE 473 Day 27 Hash table review Intro to string searching.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Search  We’ve got all the students here at this university and we want to find information about one of the students.  How do we do it?  Linked List?
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
MA/CSSE 473 Day 23 Student questions Space-time tradeoffs Hash tables review String search algorithms intro.
CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions Kevin Quinn Fall 2015.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Week 8 - Wednesday.  What did we talk about last time?  Level order traversal  BST delete  2-3 trees.
CSC 172 DATA STRUCTURES. SETS and HASHING  Unadvertised in-store special: SETS!  in JAVA, see Weiss 4.8  Simple Idea: Characteristic Vector  HASHING...The.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Chapter 5: Hashing Collision Resolution: Open Addressing Extendible Hashing Mark Allen Weiss: Data Structures and Algorithm Analysis in Java Lydia Sinapova,
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
ISOM MIS 215 Module 5 – Binary Trees. ISOM Where are we? 2 Intro to Java, Course Java lang. basics Arrays Introduction NewbieProgrammersDevelopersProfessionalsDesigners.
Data Structure & Algorithm Lecture 8 – Hashing JJCAO Most materials are stolen from Prof. Yoram Moses’s course.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
Hashing TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA Course: Data Structures Lecturer: Haim Kaplan and Uri Zwick.
Hashing Goal Perform inserts, deletes, and finds in constant average time Topics Hash table, hash function, collisions Collision handling Separate chaining.
CSE373: Data Structures & Algorithms Lecture 5: Dictionary ADTs; Binary Trees Lauren Milne Summer 2015.
COMP 103 Course Review. 2 Menu  A final word on hash collisions in Open Addressing / Probing  Course Summary  What we have covered  What you should.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Instructor: Lilian de Greef Quarter: Summer 2017
CSC 172 DATA STRUCTURES.
Hash table CSC317 We have elements with key and satellite data
Hashing CSE 2011 Winter July 2018.
Hashing Exercises.
Hashing Course: Data Structures Lecturer: Uri Zwick March 2008
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Data Structures and Algorithms
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Data Structures and Algorithms
Hash Tables and Associative Containers
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Course: Data Structures Lecturer: Uri Zwick March 2008
17CS1102 DATA STRUCTURES © 2018 KLEF – The contents of this presentation are an intellectual and copyrighted property of KL University. ALL RIGHTS RESERVED.
CSE 326: Data Structures Lecture #14
Presentation transcript:

A Look at Modern Dictionary Structures & Algorithms Warren Hunt

Dictionary Structures Used for storing information  (key, value) pairs Bread and Butter of a Data-structures and Algorithms course

Common Dictionary Structures List (Array)  Sorted List Linked List  Move to Front List  Inverted Index List  Skip List  check this one out  …

Common Dictionary Structures (Balanced) Binary Search Trees  AVL Tree  Red-Black Tree  Splay Tree  B-Tree  Trie  Patricia Tree  …

Common Dictionary Structures Hash Tables  Linear (or Quadratic) Probing  Separate Chaining (or Treeing)  Double Hashing  Perfect Hashing  Hash Trees  Cuckoo Hashing d-ary binned  …

+Every Hybrid You Can Think Of! Unfortunately, they don’t teach the cool ones…  Skip lists are a faster/easier to code alternative to most binary search trees Invented in 1990!  Cuckoo Hashing has a huge number of nice properties (IMHO far superior to all other hashing designs) Invented in 2001

So many to choose from! Which is best? That Depends on your needs… Sorted Lists are simple and easy to implement (simple means fast on small datasets!) Binary search trees and sorted lists provide easy access to sorted data B-trees have great page-performance for databases Hash tables have fastest asymptotic lookup time

Focus On Hashing for Now Fastest lookup/insert/delete time: O(1) Used in Bloom-filters  not the graphics kind! Useful in garbage collection (or anywhere you want to mark things as visited) Small hash-tables implement an associative cache Easy to implement! (no pointer chasing)

Traditional Hashing Just make up an address in an array for some piece of data and stick it there  Hash function generates the address Problems arise when two things have the same address, so we’ll address that:  Linear (or Quadratic) Probing  Separate Chaining (Treeing…)  Double Hashing

Problems With Traditional Hashing Without separate chaining, they can’t get too full or bad things happen With separate chaining, we have poor cache performance and still O(n) worst case behavior  Separate treeing provides O(log n) worst case, but they don’t teach that in school… Linear probing is still the most common (fastest cache behavior, bite the bullet on poorer memory utilization)

Good Hash Functions All hash table implementations require good hash functions (with the exception of separate treeing)  Universal hash functions are required (number theory, I won’t discuss it here)  Cuckoo hashing is less strict (different assumptions are made in each paper to make proofs easier)

Cuckoo Hashing Guaranteed O(1) lookup/delete Amortized O(1) insert 50% space efficient Requires *mostly* random hash functions  Newish and largely unknown (barely mentioned in Wikipedia-Hash Tables)

Cuckoo Hashing Use two hash tables and two hash functions Each element will have exactly one “nest” (hash location) in each table  Guarantee that any element will only ever exist in one of its “nests”  Lookup/delete are O(1) because we can check 2 locations (“nests”) in O(1) time

Cuckoo Hashing - Insertion 1. Insert an element by finding one of its “nests” and putting it there This may evict another element! (goto 2.) 2. Insert the evicted element into its *other* “nest” This may evict another element! (goto 2.) Under reasonable assumptions, this process will terminate in O(1) time…

Why does this work? Matching property of random graphs With high probability, any matching under a saturation threshold (50% in this case) can take another edge without breaking More details in the paper

Overflowing the Table Insertion can potentially fail causing an infinite insertion loop Detected using a depth cutoff  Due to unlucky hash functions  Due to a full hash table Double the size of the table (if need be), choose new hash functions and rehash all of the elements

Example To the board!

Asymetric Cuckoo Hashing Choose one (the first) table to be larger than the other  Improves the probability that we get a hit on the first lookup  Only a minor slowdown on insert

Same Table Cuckoo Hashing We didn’t actually need two separate tables.  It made the analysis much easier  But… In practice, we just need two hash functions

d-ary Cuckoo Hashing Guaranteed O(1) lookup/delete Amortized O(1) insert 97%+ space efficient Analysis requires random hash functions  (not quite as easy to implement)  (robust against crappier hash functions)

d-ary Cuckoo Hashing Use d hash tables instead of two!  Lookup and delete look at d buckets  Insert is more complicated Insertion sees a tree of possible eviction+insertion paths  BFS to find an empty nest  Random walk to find an empty nest (easier)

Bucketed Cuckoo Hashing Guaranteed O(1) lookup/delete Amortized O(1) insert 90%+ space efficient Requires *mostly* random hash functions  (easier to implement)  (better, “good” cache performance)

Bucketed Cuckoo Hashing Use two hash functions: but each hashes to an associative m-wide bucket  Lookup and delete must check at most two whole buckets  Insertion into a full bucket leaves a choice during eviction Insertion sees a tree of possible eviction+insertion paths  BFS to find an empty bucket Best first uses most empty target bucket  Random walk to find an empty bucket (easier) Use LRI eviction for easiest implementation

Generalization: Use both! Use k hash function Use bins for size m Get the best of both worlds!

Max load for O(1) Insert – 99% Guarantee (proven) 1 cell2 cells4 cells8 cells 4 hash func 97%99%99.9%100%* 3 hash func 91%97%98%99.9% 2 hash func 49%86%93%96% 1 hash func 0.06%0.6%3%12%

IBM’s Implementation IBM designed a hash table for the cell processor  Parameters: K=2, M=4 (SIMD width) If hash table fit in scratch L2:  lookup in 21 cycles Simple multiplicative hash functions worked well

Better Cache Performance than you Would Think If prefetching is used, cost of lookup is one memory latency (plus time to compute the hash function, which can be done in SIMD)  Exactly two cache-line loads Binary search trees, linear probing, linear chaining, etc… usually take more cache-line loads and have a very branchy search loop

Conclusions Cuckoo Hashing Provides:  Guaranteed O(1) lookup+delete  Amortized O(1) insert  Efficient memory utalization Both in space and bandwidth!  Small constant factors And SIMD friendly!  And is simple to implement (easier than linear probing!)

Good Hash Function? (very fast, especially if you use the __rotl intrinsic) #define mix(a,b,c) { a -= c; a ^= rot(c, 4); c += b; b -= a; b ^= rot(a, 6); a += c; c -= b; c ^= rot(b, 8); b += a; a -= c; a ^= rot(c,16); c += b; b -= a; b ^= rot(a,19); a += c; c -= b; c ^= rot(b, 4); b += a; }

Questions?