CS 345: Topics in Data Warehousing Thursday, October 28, 2004.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
Multidimensional Data
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
BTrees & Bitmap Indexes
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Chapter 8 File organization and Indices.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Designing for Performance Announcement: The 3-rd class test is coming up soon. Open book. It will cover the chapter on Design Theory of Relational Databases.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
CS 345: Topics in Data Warehousing Thursday, October 21, 2004.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
ITCS 6163 Lecture 5. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
Parallel Execution Plans Joe Chang
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
CS4432: Database Systems II Query Processing- Part 2.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Chapter 5 Index and Clustering
B+ Trees: An IO-Aware Index Structure Lecture 13.
Query Processing CS 405G Introduction to Database Systems.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
SCALING AND PERFORMANCE CS 260 Database Systems. Overview  Increasing capacity  Database performance  Database indexes B+ Tree Index Bitmap Index 
Improved Query Performance With Variant Indexes Patrick O’Neil, Dallan Quass Presented by Bo Han.
Indexing Structures Database System Implementation CSE 507 Some slides adapted from Silberschatz, Korth and Sudarshan Database System Concepts – 6 th Edition.
CS4432: Database Systems II
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Select Operation Strategies And Indexing (Chapter 8)
Module 11: File Structure
Indexing Structures for Files and Physical Database Design
Database System Implementation CSE 507
Physical Database Design and Performance
Database Management Systems (CS 564)
File organization and Indexing
Chapter 11: Indexing and Hashing
Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
Introduction to Database Systems
Lecture 15: Bitmap Indexes
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Lecture 2- Query Processing (continued)
One-Pass Algorithms for Database Operations (15.2)
Database Design and Programming
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Presentation transcript:

CS 345: Topics in Data Warehousing Thursday, October 28, 2004

Review of Tuesday’s Class Bitmap compression with BBC codes –Gaps and Tails –Variable byte-length encoding of lengths –Special handling of lone bits Speeding up star joins –Cartesian product of dimensions –Semi-join reduction –Early aggregation

Outline of Today’s Class Join Indexes Projection Indexes –Horizontal vs. Vertical decomposition Bit-Sliced Indexes –Fast bitmap counts and sums –Range queries Bit Vector Filtering * O’Neil and Quass, “Improved Query Performance with Variant Indexes”, 1997

Join Indexes Generally, indexes are built on the tuples of a single relation Join indexes include attributes from more than one relation –Or else index entries consist of attributes from one relation and RIDs from another Example: –Standard index on Customer.State Index entry consists of pair –Join index on Customer.State & Sales fact Index entry consists of pair

Defining Join Indexes Join indexes are like indexes on a materialized view –Except that the view isn’t actually materialized… Oracle syntax: –CREATE BITMAP INDEX ON Sales(Customer.State) FROM Sales, Customer WHERE Sales.Customer_key = Customer.Customer_key Creates an index with entries Join condition used in joining the tables is specified at index creation time

Why Join Indexes? Consider a star query plan using semi-join reductions –For each dimension: Determine keys of dimension rows that satisfy filter conditions Join list of dimension keys to fact table (using index) Result = list of fact RIDs –Merge lists of fact RIDs from all dimensions –Retrieve matching fact rows –Join back to grouping dimensions –Perform grouping and aggregation Join indexes allow the join step to be skipped –Join is precomputed when join index created –Fact RID list stored directly in join index Bitmap join indexes are particularly common –Other types are possible (B-Tree, etc.) –Bitmap join indexes facilitate RID list merging / index intersection

Limitations of Join Indexes Index creation cost is high –Need to compute join result to create index Index maintenance cost is high –Usually not an issue in data warehouses –Data warehouse is “read-mostly” –Drop index before warehouse load begins –Re-create after load completes Can’t access non-indexed columns of dimension table –SELECT Customer.Age, SUM(Quantity) FROM Sales, Customer WHERE Sales.Customer_key = Customer.Customer_key AND Customer.State = 'CA' –Using ordinary index on State: Get dimension table RIDs using index Look up dimension rows to learn Age for each Join to fact table to learn Quantity –Using join index on State: Get fact RIDs using index Look up fact rows to learn Quantity Join back to dimension table to learn Age

Horizontal vs. Vertical Decomposition Relations have rows and columns –“Two-dimensional” structure Disk access model is one-dimensional stream –Disk spins fast, disk arm moves slowly –Sequential I/O = leave the disk arm in one place –Essentially, disk provides one-dimensional locality Need to store the bytes in some particular order –Embedding 2D structure in 1D space Two options: –Horizontal decomposition (“row-major” order) –Vertical decomposition (“column-major” order)

Horizontal vs. Vertical Horizontal decomposition –All attributes of a record are stored together –Values for the same attribute but different records are separated –Standard DBMS storage model Vertical decomposition –All values of an attribute are stored together –Values for the same record but different attributes are separated Col1Col2Col Col1Col2Col Horizontal Vertical

Horizontal vs. Vertical Horizontal decomposition –Good locality when: Multiple attributes from same record are co-accessed Not too many records are co-accessed –For example: Inserts and deletes Entire record is in one place Perform insert or delete with a single I/O Vertical decomposition –Good locality when: Multiple values of the same attribute are co-accessed Not too many attributes are co-accessed –Frequently the case for OLAP queries

Vertical Decomposition Example SELECT customer_key, income FROM Customer WHERE state = 'CA' AND age > 50 –Subplan for OLAP query that filters on state, age and groups on income –Suppose Customer has 100 attributes Horizontally decomposed data: –Scan table, one record at a time –Ignore 96 attributes of each record –Use 4 attributes (state, age, income, customer_key) –Lots of wasted I/O bandwidth Vertically decomposed data: –Parallel scans of customer_key, income, state, and age columns –If ith state = ‘CA’ and ith age > 50, then output ith customer_key and ith income –No redundant attributes are read from disk –I/O is kept to a minimum

Projection Index The idea behind projection indexes –Databases usually store data in horizontal format –Vertical format is more efficient for many analysis queries –Why not do both? Projection index –Logically: Index entries are pairs –Stored in same order as records in relation i.e. sorted by RID instead of sorted by Value –In practice: Storing RID is unnecessary Array storage format Array index determined from RID Incremental approach to vertical decomposition –Vertical decomposition = complete set of projection indexes –A few projection indexes = partial vertical decomposition

Using Projection Indexes Consider a star query via semi-join decomposition –For each dimension: Determine keys of dimension rows that satisfy filter conditions Join list of dimension keys to fact table (using index) Result = list of fact RIDs –Merge lists of fact RIDs from all dimensions –Retrieve matching fact rows –Join back to grouping dimensions –Perform grouping and aggregation

Using Projection Indexes “Retrieve matching fact rows” step can be expensive –All we care about are values of aggregated columns –Other attributes are irrelevant (e.g. dimension foreign keys) –Poor packing of relevant information into disk pages Relevant and irrelevant data is clumped together Using a projection index is often faster –List of fact RIDs tells us which index entries to read –Many index entries packed into same disk page No “clutter” from irrelevant fields –Fewer I/Os required

Bit-Sliced Indexes Bit-sliced indexes –Generally used for measurement columns –Allow for: Efficient aggregation Efficient range filtering (particularly for large ranges) –Most suitable for bitmap-based plans –Requires positive integer-valued column Fixed-precision decimals OK Example: Interpret $5.67 as 567 cents Bit-sliced index on attribute A –Treat A as multiple logical binary-valued columns Column A1 = Least significant bit of A Column A2 = 2 nd least significant bit of A Etc. Number of logical columns determined by max value –Store each column as a separate bitmap

Bit-Sliced Index Example Amount Binary B4: B3: B2: B1: Bit-Sliced Index

Fast Bitmap Aggregation Take advantage of word-level parallelism –Implicit parallelism arising from SIMD operations in modern computer architectures –SIMD: Single Instruction, Multiple Data –Processor can compute bitwise operations on all bits in a word at the same time –Index intersection via bitmap merge takes advantage of this fact –It’s one reason for byte-aligned compression

Fast Bitmap Count Count the number of 1’s in a bitmap –Treat the bitmap as a byte array –Pre-compute lookup table with number of 1’s in each byte –Cycle through bitmap one byte at a time, accumulating count using lookup table Pseudocode –count = 0; for (int i = 0; i < n/8; i++) count += numSetBits[bitmap[i]]; –numSetBits[0] = 0, numSetBits[7] = 3, etc. Treating bitmap as short int array → even faster –Lookup table has entries instead of 256 –Bitmap of n bits → only add n/16 numbers

SUM using Bit-Sliced Index Suppose B f represents the foundset –foundset = List of fact RIDs that pass all filters For each bit slice B i : –Compute B f AND B i –Do fast bitmap count of resulting bitmap –Multiply count by 2 i Total = sum of weighted counts for all slices

SUM Example Amount B4: B3: B2: B1: Bit-Sliced Index Count of B4: 1 Count of B3: 4 Count of B2: 3 Count of B1: 3 1* * * *2 1 = = = 33

Range Filtering with Bit-Sliced Indexes Bit-sliced indexes allow range filtering Cost of applying range predicate independent of size of range –Not true for bitmap indexes, B-Trees We’ll give the algorithm for “A < c” –A is the attribute that is indexed –c is some constant –Other operations (>, =, etc.) are similar

Pseudocode for “A < c” Set B LT = all zeros. Set B EQ = all ones. For each bit slice B i, from most to least significant: –If bit i of constant c is 1: B LT = B LT OR (B EQ AND NOT(B i )) B EQ = B EQ AND B i –If bit i of constant c is 0: B EQ = B EQ AND NOT(B i ) Return B LT. Why does it work? Invariant: B EQ = 1 for all rows that match c on the most significant bits (and only those rows) A value x is less than c iff for some bit i: –x and c agree on all bits more significant than i –The ith bit of x is 0, and the ith bit of c is 1

“Amount < 7” Example Amount B4: B3: B2: B1: Bit-Sliced Index 7 = 0111 B LT B EQ

Bit-Sliced vs. Projection Index Both benefit from vertical decomposition –Values for 1 column are packed onto as few pages as possible Both allow for fast aggregation Both allow range filtering Bit-sliced indexes can be faster when: –Attribute values don’t use full data type precision –Query is CPU-bound Fewer machine instructions due to SIMD parallelism Bit-sliced indexes are more complicated

Bit Vector Filtering Sometimes called “Bloom Join” –From “Bloom Filters” [Bloom 1970] Bloom filter = cheap, approximate semi-join –Store bitmap with 1 bit per hash bucket –Initialize bitmap to all zeros –For each record in relation A: Compute hash value of join key Set the bit for the appropriate hash bucket to 1 In distributed database, send bitmap to Server 2. For each record in relation B: –Compute hash value of join key –Check whether bit for the appropriate bucket is set Yes → Record might join to something in A No → Record definitely doesn’t join to anything in A Send qualifying B records to Server 1 & compute join

Bit Vector Filtering Also useful in non-distributed database Applies to multi-pass hash or merge join Hash join –Partition relation A –Partition relation B –Join each A partition with matching B partition Merge join –Generate sorted runs for relation A –Generate sorted runs for relation B –Merge A’s runs, Merge B’s runs, Merge A & B Bit vector filtering –While pre-processing A, generate Bloom filter –While pre-processing B, discard records that don’t match filter –Significantly reduce size of B at low cost –Since A is already being scanned, generating the Bloom filter is “free” –Bloom filter can be made as small or large as memory permits Fewer buckets → more collisions → less effective filtering Fewer buckets → less memory to store bitmap

Next week: Physical DB Design This concludes the query processing topic Next week, we’ll begin physical database design –Selection of indexes and materialized views –Partitioning and data layout –RAID and hardware considerations –Physical design trade-offs –Database tuning Note on course project: –Start thinking about topics –We’ll discuss the project in more detail next class.