July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

Slides:



Advertisements
Similar presentations
Arjun Suresh S7, R College of Engineering Trivandrum.
Advertisements

Big Data Working with Terabytes in SQL Server Andrew Novick
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Making earth science data more accessible: experience with chunking and compression Russ Rew January rd Annual AMS Meeting Austin, Texas.
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
BTrees & Bitmap Indexes
Grid Collector: Enabling File-Transparent Object Access For Analysis Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani.
HDF5 FastQuery Accelerating Complex Queries on HDF Datasets using Fast Bitmap Indices John Shalf, Wes Bethel LBNL Visualization Group Kensheng Wu, Kurt.
CS561-S2004 strategies for processing ad hoc queries 1 Strategies for Processing Ad Hoc Queries on Large Data Warehouses Presented by Fan Wu Instructor:
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
COMP 451/651 Multiple-key indexes
Chapter 14 The Second Component: The Database.
5 Creating the Physical Model. Designing the Physical Model Phase IV: Defining the physical model.
STACS STACS: Storage Access Coordination of Tertiary Storage for High Energy Physics Applications Arie Shoshani, Alex Sim, John Wu, Luis Bernardo*, Henrik.
1 Indexing Large Trajectory Data Sets With SETI V.Prasad Chakka Adam C.Everspaugh Jignesh M.Patel University of Michigan Presented by Guangyue Jia.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
Lecture 8 Index Organized Tables Clusters Index compression
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
© Paradigm Publishing Inc. 9-1 Chapter 9 Database and Information Management.
Bitmap Indices for Speeding Up End User Physics Analysis Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated.
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.
Int. Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT2005), Zeuthen, Germany, May 2005 Bitmap Indices for Fast End-User.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
DBMS Implementation Chapter 6.4 V3.0 Napier University Dr Gordon Russell.
Data Warehouse Design Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
The STAR Grid Collector and TBitmapIndex John Wu Kurt Stockinger, Rene Brun, Philippe Canal – TBitmapIndex Junmin Gu, Jerome Lauret, Arthur M. Poskanzer,
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†
Scientific Data Management Research Group National Energy Research Scientific Computing Center, L B N L 1 Henrik Nordberg, June 1998 Query Estimator Henrik.
STAR Collaboration, July 2004 Grid Collector Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani Lawrence Berkeley National.
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Chapter 4 Logical & Physical Database Design
B. Information Technology (Hons.) CMPB245: Database Design Physical Design.
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Improved Query Performance With Variant Indexes Patrick O’Neil, Dallan Quass Presented by Bo Han.
Thomas Heinis* Eleni Tzirita Zacharatou ‡ Farhan Tauheed § Anastasia Ailamaki ‡ RUBIK: Efficient Threshold Queries on Massive Time Series § Oracle Labs,
Universiteit Utrecht MONET CD Session 9 | Monday 6 June 2005 Lee Provoost.
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
Module 11: File Structure
COMP 430 Intro. to Database Systems
Database Performance Tuning and Query Optimization
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Communication and Memory Efficient Parallel Decision Tree Construction
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Chapter 11 Database Performance Tuning and Query Optimization
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #04 Schema versioning and File organizations Instructor: Chen Li.
CS222P: Principles of Data Management UCI, Fall 2018 Notes #04 Schema versioning and File organizations Instructor: Chen Li.
Presentation transcript:

July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani

July, 2001 The big picture gridstorage MPI-IOfile Request Interpreter dataset Data mining DistributedLarge

July, 2001 The big picture Request interpreter Logical request Qualified objects Request planning/execution Execution services grid LBNL PPDG MPI-IO, … Sub-task schedule

July, 2001 Problem statement Main objective: maps logical request to qualified objects —a logical request: <=eventTime & 200<energy<300 … —objects: set of object IDs; set of files containing the objects; offsets within the files, …

July, 2001 Requirements & Status General requirements —User request data in terms of their scientific domain, not file names or offsets in files —Each object may be described in hundreds of attributes —Each request is in terms of range predicates on a handful of attributes (partial range query) Status —Initially motivated by a HENP experiment: STAR —Software originally developed under GC and is currently in use at BNL

July, 2001 Large high-dimensional datasets Number of attributes / columns: 200 – 500 Number of objects / events: 10 8 – 10 9 File containing one attribute: 400MB – 4GB Total size over all attributes: 80GB – 2TB A1A2A3A4…Object ID Goal: develop an index, so that: Read as little as possible from disk Minimize computation in memory Curse of dimensionality

July, 2001 Well known indexing methods B-tree based indices —One or a small number of attributes —Index size may be up to 3 times the data size R-tree based indices —Small number of attributes, say, < 10 UB-tree —Use space filling curves to map high-dimensional data to one-dimension —One range query is mapped into many many queries on the B-tree based index Even sequential scan —Better than B-tree and R-tree if dimension > 10 —Simply read all data and compare  take too long

July, 2001 Another class of indexes: Bitmap index Example queries on the attribute, say, A One-sided range query: A < 2 —b 0 OR b 1 Two-sided range query: 2<A<5 —b 3 OR b 4 Basic steps of building a bitmap index —Binning —Encoding —Compressing Data values =0=1=2=3=4=5 b0b0 b1b1 b2b2 b3b3 b4b4 b5b5

July, 2001 How many bins? Range(x) Range(y) Edge bin More bins Less objects in edge bins

July, 2001 How to encode Equality encoding Range encoding Interval encoding 6 bins

July, 2001 Advantages of bitmap indices Fast operations —The most common operations are the bitwise logical operations —They are well supported by hardware Easy to compress, potentially small index size Each individual bitmap is small and frequently used ones can be cached in memory Efficient for read-mostly data: data produced from scientific experiments can be appended in large groups Available in most major commercial DBMS

July, 2001 Why our own bitmap index Early tests shown that we can do an order of magnitude better than ORACLE (using equality encoding) Vertical partition: allows one to only read data of the attributes involved in a query New compression method —Best known: Byte-aligned Bitmap Code (BBC) —Developed 2 Word-Aligned Schemes: WAH, WBC Different encoding schemes under compression —Equality encoding – used in ORACLE and others —Range encoding – one-sided range queries —Interval encoding – two-sided range queries

July, 2001 Information about the test machines Hardware and system —Sun enterprise 450 (Ultrasparc II 400MHz) —4GB RAM —VARITAS volume manager (stripped disk) Real application data from STAR —Above 2 million objects —Picked 12 attributes with varying distributions Measures: —Logical operation time without IO —Logical operation time with IO —Query processing time

July, 2001 Logical operation time (no IO)

July, 2001 Logical operation time (including IO)

July, 2001 New compression schemes Overall, use about 50% more space than BBC On average, 12 times faster than BBC Faster than the uncompressed in more cases: —New schemes are faster than the uncompressed scheme when the compression ratios are less than 0.3 —BBC is faster than the uncompressed when the compression ratios are less than 0.03

July, 2001 Sizes of bitmap indices Conclusion: - equality encoding is most space efficient - Compression gain is at least a factor of 2.5

July, 2001 Average query processing time Conclusion: - interval and range encoding are the best - For these cases, there is practically no penalty to compression

July, 2001 Interval encoding is better overall Sequential scan time: sec

July, 2001 Summary Better compression scheme —50% more space, but time faster !!! Among the different encoding schemes —the interval encoding is the better than the equality encoding and the range encoding Selecting the number of bins => Bitmap index size and operation efficiency. For example: —10% of data size => 3 x speed of sequential scan —20% of data size => 6 x speed of sequential scan Equality encoding currently used in the STAR experiment. Next version will include the interval encoding.

July, 2001 Future work Support NULL value and categorical values On-line update: add new data and update index without interrupting request processing Recovery mechanism for robustness Potential new applications: climate, astrophysics, biology Study different non-uniform binning strategies Integrate with conventional database system: to better handle metadata, to provide more versatile front-end