CS 277 – Spring 2002Notes 41 CS 277: Database System Implementation Notes 4: Indexing Arthur Keller.

Slides:



Advertisements
Similar presentations
Index tuning-- B+tree. overview Variation on B+tree: B-tree (no +) Idea: –Avoid duplicate keys –Have record pointers in non-leaf nodes.
Advertisements

Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Hashing and Indexing John Ortiz.
Advanced Data Structures NTUA Spring 2007 B+-trees and External memory Hashing.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
1 Lecture 8: Data structures for databases II Jose M. Peña
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #7.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
1 Advanced Database Technology Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Spring 2004 March 4, 2004 INDEXING II Lecture based on [GUW,
CS4432: Database Systems II
Indexing Techniques. Advanced DatabasesIndexing Techniques2 The Problem What can we introduce to make search more efficient? –Indices! What is an index?
CS CS4432: Database Systems II Basic indexing.
B-trees - Hashing. 11.2Database System Concepts Review: B-trees and B+-trees Multilevel, disk-aware, balanced index methods primary or secondary dense.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 More on Indexes Secondary Indexes B-Trees Source: our textbook, slides by Hector Garcia-Molina.
1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #10.
1 Advanced Database Technology Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Spring 2004 February 19, 2004 INDEXING I Lecture based on [GUW,
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #7.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
CS 4432lecture #71 CS4432: Database Systems II Lecture #7 Professor Elke A. Rundensteiner.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Primary Indexes Dense Indexes
CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.
CS 255: Database System Principles slides: B-trees
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
CS4432: Database Systems II
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Index tuning-- B+tree. overview © Dennis Shasha, Philippe Bonnet 2001 B+-Tree Locking Tree Traversal –Update, Read –Insert, Delete phantom problem: need.
1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
DBMS 2001Notes 4.1: B-Trees1 Principles of Database Management Systems 4.1: B-Trees Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina,
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Index Tuning Conventional index Secondary index To speed up queries on attributes not within primary key Primary index –Determine.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
Index Tuning Conventional index. Overview.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
B+ tree & B tree Extracted from Garcia Molina
Indexing and B+-Trees By Kenneth Cheung CS 157B TR 07:30-08:45 Professor Lee.
1 Chapter 12: Indexing and Hashing Indexing Indexing Basic Concepts Basic Concepts Ordered Indices Ordered Indices B+-Tree Index Files B+-Tree Index Files.
1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.
CS4432: Database Systems II
1 Query Processing Part 3: B+Trees. 2 Dense and Sparse Indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Insertions.
CS 405G: Introduction to Database Systems 12. Index.
1 Ullman et al. : Database System Principles Notes 4: Indexing.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Access Structures COMP3211 Advanced Databases Dr Nicholas Gibbins
COMP3017 Advanced Databases
Indexing and hashing.
CS 245: Database System Principles Notes 4: Indexing
CS 245: Database System Principles Notes 4: Indexing
CS232A: Database System Principles INDEXING
(Slides by Hector Garcia-Molina,
CS 245: Database System Principles Notes 4: Indexing
B+Tree Example n=3 Root
Database Design and Programming
Indexing 4/11/2019.
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

CS 277 – Spring 2002Notes 41 CS 277: Database System Implementation Notes 4: Indexing Arthur Keller

CS 277 – Spring 2002Notes 42 Indexing & Hashing value Chapter 4  value record

CS 277 – Spring 2002Notes 43 Topics Conventional indexes B-trees Hashing schemes

CS 277 – Spring 2002Notes 44 Sequential File

CS 277 – Spring 2002Notes 45 Sequential File Dense Index

CS 277 – Spring 2002Notes 46 Sequential File Sparse Index

CS 277 – Spring 2002Notes 47 Sequential File Sparse 2nd level

CS 277 – Spring 2002Notes 48 Comment: {FILE,INDEX} may be contiguous or not (blocks chained)

CS 277 – Spring 2002Notes 49 Question: Can we build a dense, 2nd level index for a dense index?

CS 277 – Spring 2002Notes 410 Notes on pointers: (1) Block pointer (sparse index) can be smaller than record pointer BP RP

CS 277 – Spring 2002Notes 411 (2) If file is contiguous, then we can omit pointers (i.e., compute them) Notes on pointers:

CS 277 – Spring 2002Notes 412 K1 K3 K4 K2 R1R2R3R4 say: 1024 B per block if we want K3 block: get it at offset (3-1)1024 = 2048 bytes

CS 277 – Spring 2002Notes 413 Sparse vs. Dense Tradeoff Sparse: Less index space per record can keep more of index in memory Dense: Can tell if any record exists without accessing file (Later: –sparse better for insertions –dense needed for secondary indexes)

CS 277 – Spring 2002Notes 414 Terms Index sequential file Search key (  primary key) Primary index (on Sequencing field) Secondary index Dense index (all Search Key values in) Sparse index Multi-level index

CS 277 – Spring 2002Notes 415 Next: Duplicate keys Deletion/Insertion Secondary indexes

CS 277 – Spring 2002Notes 416 Duplicate keys

CS 277 – Spring 2002Notes Dense index, one way to implement? Duplicate keys

CS 277 – Spring 2002Notes Dense index, better way? Duplicate keys

CS 277 – Spring 2002Notes Sparse index, one way? Duplicate keys careful if looking for 20 or 30!

CS 277 – Spring 2002Notes Sparse index, another way? Duplicate keys – place first new key from block should this be 40?

CS 277 – Spring 2002Notes 421 Duplicate values, primary index Index may point to first instance of each value only File Index Summary a a a b 

CS 277 – Spring 2002Notes 422 Deletion from sparse index

CS 277 – Spring 2002Notes 423 Deletion from sparse index – delete record 40

CS 277 – Spring 2002Notes 424 Deletion from sparse index – delete record 30 40

CS 277 – Spring 2002Notes 425 Deletion from sparse index – delete records 30 &

CS 277 – Spring 2002Notes 426 Deletion from dense index

CS 277 – Spring 2002Notes 427 Deletion from dense index – delete record 30 40

CS 277 – Spring 2002Notes 428 Insertion, sparse index case

CS 277 – Spring 2002Notes 429 Insertion, sparse index case – insert record our lucky day! we have free space where we need it!

CS 277 – Spring 2002Notes 430 Insertion, sparse index case – insert record Illustrated: Immediate reorganization Variation: – insert new block (chained file) – update index

CS 277 – Spring 2002Notes 431 Insertion, sparse index case – insert record overflow blocks (reorganize later...)

CS 277 – Spring 2002Notes 432 Insertion, dense index case Similar Often more expensive...

CS 277 – Spring 2002Notes 433 Secondary indexes Sequence field

CS 277 – Spring 2002Notes 434 Secondary indexes Sequence field Sparse index does not make sense!

CS 277 – Spring 2002Notes 435 Secondary indexes Sequence field Dense index sparse high level

CS 277 – Spring 2002Notes 436 With secondary indexes: Lowest level is dense Other levels are sparse Also: Pointers are record pointers (not block pointers; not computed)

CS 277 – Spring 2002Notes 437 Duplicate values & secondary indexes

CS 277 – Spring 2002Notes 438 Duplicate values & secondary indexes one option... Problem: excess overhead! disk space search time

CS 277 – Spring 2002Notes 439 Duplicate values & secondary indexes another option Problem: variable size records in index!

CS 277 – Spring 2002Notes 440 Duplicate values & secondary indexes Another idea (suggested in class): Chain records with same key? Problems: Need to add fields to records Need to follow chain to know records

CS 277 – Spring 2002Notes 441 Duplicate values & secondary indexes buckets

CS 277 – Spring 2002Notes 442 Why “bucket” idea is useful IndexesRecords Name: primary EMP (name,dept,floor,...) Dept: secondary Floor: secondary

CS 277 – Spring 2002Notes 443 Query: Get employees in (Toy Dept) ^ (2nd floor) Dept. indexEMP Floor index Toy 2nd  Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s

CS 277 – Spring 2002Notes 444 This idea used in text information retrieval Documents...the cat is fat......was raining cats and dogs......Fido the dog... Inverted lists cat dog

CS 277 – Spring 2002Notes 445 IR QUERIES Find articles with “cat” and “dog” Find articles with “cat” or “dog” Find articles with “cat” and not “dog” Find articles with “cat” in title Find articles with “cat” and “dog” within 5 words

CS 277 – Spring 2002Notes 446 Common technique: more info in inverted list cat Title5 100 Author10 Abstract57 Title12 d3d3 d2d2 d1d1 dog type position location

CS 277 – Spring 2002Notes 447 Posting: an entry in inverted list. Represents occurrence of term in article Size of a list:1Rare words or (in postings) miss-spellings 10 6 Common words Size of a posting: bits (compressed)

CS 277 – Spring 2002Notes 448 IR DISCUSSION Stop words Truncation Thesaurus Full text vs. Abstracts Vector model

CS 277 – Spring 2002Notes 449 Vector space model w1 w2 w3 w4 w5 w6 w7 … DOC = Query= PRODUCT = 1 + ……. = score

CS 277 – Spring 2002Notes 450 Tricks to weigh scores + normalize e.g.: Match on common word not as useful as match on rare words...

CS 277 – Spring 2002Notes 451 How to process VS. Queries? w1 w2 w3 w4 w5 w6 … Q =

CS 277 – Spring 2002Notes 452 Try Altavista, Excite, Infoseek, Lycos...

CS 277 – Spring 2002Notes 453 Summary so far Conventional index –Basic Ideas: sparse, dense, multi-level… –Duplicate Keys –Deletion/Insertion –Secondary indexes –Buckets of Postings List

CS 277 – Spring 2002Notes 454 Conventional indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Inserts expensive, and/or - Lose sequentiality & balance

CS 277 – Spring 2002Notes 455 ExampleIndex (sequential) continuous free space overflow area (not sequential)

CS 277 – Spring 2002Notes 456 Outline: Conventional indexes B-Trees  NEXT Hashing schemes

CS 277 – Spring 2002Notes 457 NEXT: Another type of index –Give up on sequentiality of index –Try to get “balance”

CS 277 – Spring 2002Notes 458 Root B+Tree Examplen=

CS 277 – Spring 2002Notes 459 Sample non-leaf to keysto keysto keys to keys < 5757  k<8181  k<95 

CS 277 – Spring 2002Notes 460 Sample leaf node: From non-leaf node to next leaf in sequence To record with key 57 To record with key 81 To record with key 85

CS 277 – Spring 2002Notes 461 In textbook’s notationn=3 Leaf: Non-leaf:

CS 277 – Spring 2002Notes 462 Size of nodes:n+1 pointers n keys (fixed)

CS 277 – Spring 2002Notes 463 Don’t want nodes to be too empty Use at least Non-leaf:  (n+1)/2  pointers Leaf:  (n+1)/2  pointers to data

CS 277 – Spring 2002Notes 464 Full nodemin. node Non-leaf Leaf n= counts even if null

CS 277 – Spring 2002Notes 465 B+tree rulestree of order n (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer”

CS 277 – Spring 2002Notes 466 (3) Number of pointers/keys for B+tree Non-leaf (non-root) n+1n  (n+1)/ 2   (n+1)/ 2  - 1 Leaf (non-root) n+1n Rootn+1n11 Max Max Min Min ptrs keys ptrs  data keys  (n+ 1) / 2 

CS 277 – Spring 2002Notes 467 Insert into B+tree (a) simple case –space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root

CS 277 – Spring 2002Notes 468 (a) Insert key = 32 n=

CS 277 – Spring 2002Notes 469 (a) Insert key = 7 n=

CS 277 – Spring 2002Notes 470 (c) Insert key = 160 n=

CS 277 – Spring 2002Notes 471 (d) New root, insert 45 n= new root

CS 277 – Spring 2002Notes 472 (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf Deletion from B+tree

CS 277 – Spring 2002Notes 473 (b) Coalesce with sibling –Delete n=4 40

CS 277 – Spring 2002Notes 474 (c) Redistribute keys –Delete n=4 35

CS 277 – Spring 2002Notes (d) Non-leaf coalese –Delete 37 n= new root

CS 277 – Spring 2002Notes 476 B+tree deletions in practice –Often, coalescing is not implemented –Too hard and not worth it!

CS 277 – Spring 2002Notes 477 Comparison: B-trees vs. static indexed sequential file Ref #1: Held & Stonebraker “B-Trees Re-examined” CACM, Feb. 1978

CS 277 – Spring 2002Notes 478 Ref # 1 claims: - Concurrency control harder in B-Trees - B-tree consumes more space For their comparison: block = 512 bytes key = pointer = 4 bytes 4 data records per block

CS 277 – Spring 2002Notes 479 Example: 1 block static index 127 keys (127+1)4 = 512 Bytes -> pointers in index implicit!up to 127 blocks k1 k2 k3 k1k2k3 1 data block

CS 277 – Spring 2002Notes 480 Example: 1 block B-tree 63 keys 63x(4+4)+8 = 512 Bytes -> pointers needed in B-treeup to 63 blocks because index isblocks not contiguous k1 k2... k63 k1k2k3 1 data block next -

CS 277 – Spring 2002Notes 481 Size comparison Ref. #1 Static Index B-tree # data blocks height 2 -> > > 16, > ,130 -> 2,048, > 250, ,048 -> 15,752,961 5

CS 277 – Spring 2002Notes 482 Ref. #1 analysis claims For an 8,000 block file, after 32,000 inserts after 16,000 lookups  Static index saves enough accesses to allow for reorganization Ref. #1 conclusion Static index better!!

CS 277 – Spring 2002Notes 483 Ref #2: M. Stonebraker, “Retrospective on a database system,” TODS, June 1980 Ref. #2 conclusion B-trees better!!

CS 277 – Spring 2002Notes 484 DBA does not know when to reorganize DBA does not know how full to load pages of new index Ref. #2 conclusion B-trees better!!

CS 277 – Spring 2002Notes 485 Buffering –B-tree: has fixed buffer requirements –Static index: must read several overflow blocks to be efficient (large & variable size buffers needed for this) Ref. #2 conclusion B-trees better!!

CS 277 – Spring 2002Notes 486 Speaking of buffering… Is LRU a good policyfor B+tree buffers?  Of course not!  Should try to keep root in memory at all times (and perhaps some nodes from second level)

CS 277 – Spring 2002Notes 487 Interesting problem: For B+tree, how large should n be? … n is number of keys / node

CS 277 – Spring 2002Notes 488 Sample assumptions: (1) Time to read node from disk is ( n) msec. (2) Once block in memory, use binary search to locate key: (a + b LOG 2 n) msec. For some constants a,b; Assume a << 70 (3) Assume B+tree is full, i.e., # nodes to examine is LOG n N where N = # records

CS 277 – Spring 2002Notes 489  Can get: f(n) = time to find a record f(n) n opt n

CS 277 – Spring 2002Notes 490  FIND n opt by f’(n) = 0 Answer is n opt = “few hundred” (see homework for details)  What happens to n opt as Disk gets faster? CPU get faster?

CS 277 – Spring 2002Notes 491 Variation on B+tree: B-tree (no +) Idea: –Avoid duplicate keys –Have record pointers in non-leaf nodes

CS 277 – Spring 2002Notes 492 to record to record to record with K1 with K2 with K3 to keys to keys to keys to keys k3 K1 P1K2 P2K3 P3

CS 277 – Spring 2002Notes 493 B-tree examplen= sequence pointers not useful now! (but keep space for simplicity)

CS 277 – Spring 2002Notes 494 Note on inserts Say we insert record with key = n=3 leaf 10 – 20 – Afterwards:

CS 277 – Spring 2002Notes 495 So, for B-trees: MAXMIN Tree Rec Keys PtrsPtrs Ptrs Ptrs Non-leaf non-root n+1nn  (n+1)/2   (n+1)/2  -1  (n+1)/2  -1 Leaf non-root 1nn 1  (n+1)/2   (n+1)/2  Root non-leaf n+1nn Root Leaf 1nn 1 1 1

CS 277 – Spring 2002Notes 496 Tradeoffs: B-trees have faster lookup than B+trees  in B-tree, non-leaf & leaf different sizes  in B-tree, deletion more complicated  B+trees preferred!

CS 277 – Spring 2002Notes 497 But note: If blocks are fixed size (due to disk and buffering restrictions) Then lookup for B+tree is actually better!!

CS 277 – Spring 2002Notes 498 Example: - Pointers 4 bytes - Keys4 bytes - Blocks100 bytes (just example) - Look at full 2 level tree

CS 277 – Spring 2002Notes 499 Root has 8 keys + 8 record pointers + 9 son pointers = 8x4 + 8x4 + 9x4 = 100 bytes B-tree: Each of 9 sons: 12 rec. pointers (+12 keys) = 12x(4+4) + 4 = 100 bytes 2-level B-tree, Max # records = 12x9 + 8 = 116

CS 277 – Spring 2002Notes 4100 Root has 12 keys + 13 son pointers = 12x4 + 13x4 = 100 bytes B+tree: Each of 13 sons: 12 rec. ptrs (+12 keys) = 12x(4 +4) + 4 = 100 bytes 2-level B+tree, Max # records = 13x12 = 156

CS 277 – Spring 2002Notes 4101 So... ooooooooooooo ooooooooo 156 records 108 records Total = 116 B+ B 8 records Conclusion: –For fixed block size, –B+ tree is better because it is bushier

CS 277 – Spring 2002Notes 4102 Outline/summary Conventional Indexes Sparse vs. dense Primary vs. secondary B trees B+trees vs. B-trees B+trees vs. indexed sequential Hashing schemes-->Next