Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.

Slides:

Advertisements

Similar presentations

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

Advertisements

Multidimensional Data. Many applications of databases are "geographic" = 2dimensional data. Others involve large numbers of dimensions. Example: data.

Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “whereamI” queries.

1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific.

Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.

Grid Collector: Enabling File-Transparent Object Access For Analysis Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani.

The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.

1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

1998/5/21by Chang I-Ning1 ImageRover: A Content-Based Image Browser for the World Wide Web Introduction Approach Image Collection Subsystem Image Query.

Physical Database Monitoring and Tuning the Operational System.

CS561-S2004 strategies for processing ad hoc queries 1 Strategies for Processing Ad Hoc Queries on Large Data Warehouses Presented by Fan Wu Instructor:

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

5.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 5: Working with File Systems.

Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.

A Comparsion of Databases and Data Warehouses Name: Liliana Livorová Subject: Distributed Data Processing.

Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory.

Report ： Zhen Ming Wu 2008 IEEE 9th Grid Computing Conference.

C-Store: A Column-oriented DBMS Speaker: Zhu Xinjie Supervisor: Ben Kao.

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.

July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

Scientific Data Management (SDM)

CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.

Lecture 9 Methodology – Physical Database Design for Relational Databases.

Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.

Oracle Index study for Event TAG DB M. Boschini S. Della Torre

The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.

100 Million events, what does this mean ?? STAR Grid Program overview.

Int. Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT2005), Zeuthen, Germany, May 2005 Bitmap Indices for Fast End-User.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

1 Use of SRMs in Earth System Grid Arie Shoshani Alex Sim Lawrence Berkeley National Laboratory.

Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

The STAR Grid Collector and TBitmapIndex John Wu Kurt Stockinger, Rene Brun, Philippe Canal – TBitmapIndex Junmin Gu, Jerome Lauret, Arthur M. Poskanzer,

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

Introduction to Using the Data Step Hash Object with Large Data Sets Richard Allen Peak Stat.

D C a c h e Michael Ernst Patrick Fuhrmann Tigran Mkrtchyan d C a c h e M. Ernst, P. Fuhrmann, T. Mkrtchyan Chep 2003 Chep2003 UCSD, California.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.

September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.

HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.

The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.

1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.

1 GCA Application in STAR GCA Collaboration Grand Challenge Architecture and its Interface to STAR Sasha Vaniachine presenting for the Grand Challenge.

The Earth System Grid (ESG) Computer Science and Technologies DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003.

John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory.

INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.

Methodology – Physical Database Design for Relational Databases.

Efficient Peer-to-Peer Keyword Searching 1 Efficient Peer-to-Peer Keyword Searching Patrick Reynolds and Amin Vahdat presented by Volker Kudelko.

February 28, 2003Eric Hjort PDSF Status and Overview Eric Hjort, LBNL STAR Collaboration Meeting February 28, 2003.

STAR Collaboration, July 2004 Grid Collector Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani Lawrence Berkeley National.

January 26, 2003Eric Hjort HRMs in STAR Eric Hjort, LBNL (STAR/PPDG Collaborations)

Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.

Evidence from Content INST 734 Module 2 Doug Oard.

Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,

Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.

March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.

Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.

Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.

PPDG meeting, July 2000 Interfacing the Storage Resource Broker (SRB) to the Hierarchical Resource Manager (HRM) Arie Shoshani, Alex Sim (LBNL) Reagan.

Bigtable: A Distributed Storage System for Structured Data

1 Efficient Data Access for Distributed Computing at RHIC A. Vaniachine Efficient Data Access for Distributed Computing at RHIC A. Vaniachine Lawrence.

Lesson 9: SOFTWARE ICT Fundamentals 2nd Semester SY

Memory Management.

Memory COMPUTER ARCHITECTURE

ITIS 5160 Indexing.

Multidimensional Access Structures

Database Management Systems (CS 564)

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Proposal for a DØ Remote Analysis Model (DØRAM)

Presentation transcript:

Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National Laboratory Wei-Ming Zhang Kent State University Jerome Lauret Brookhaven National Laboratory

Outline  Overview bitmap index  Introduction to FastBit  Overview of Grid Collector  Two use cases  “common” jobs  “exotic” jobs

Basic Bitmap Index  Compact: one bit per distinct value per object  Easy to build: faster than common B-trees  Efficient to query: only bitwise logical operations  A < 2  b 0 OR b 1  2<A<5  b 3 OR b 4  Efficient for multidimensional queries  Use bitwise operations to combine the partial results Data values =0=1=2=3=4=5 b0b0 b1b1 b2b2 b3b3 b4b4 b5b5

An Efficient Compression Scheme -- Word-Aligned Hybrid Code ……………… bits 01000… Literal word 100… Fill word 001…111 Literal word Run length is 63 WAH includes three words Groups bits into bit groups Encode each group using one word 31 bits 63*31 bits 31 bits … Merge neighboring groups with identical bits

Compressed Bitmap Index Is Compact  Expected index size of a uniform random attribute (in number of words) is smaller than typical B-trees (3N~4N) N is the number of rows, w is the number of bits per word, c is the number of distinct value, i.e., the attribute cardinality 100 M, synthetic25 M, combustion

Compressed Bitmap Index Is Optimal For 1-dimensional Query  Compressed bitmap indices are optimal for one-attribute range conditions  Query processing time using is at worst proportional to the number of hits  Only a small number of most efficient indexing schemes, such as B-tree, has this property  Bitmap indices are also efficient for multidimensional queries

Compressed Bitmap Index Is Efficient For Multi-dimensional Queries Log-log plot of query processing time for different size queries The compressed bitmap index is at least 10X faster than B-tree and 3X faster than the projection index

Data Analysis Process In STAR  Users want to analyze “some” (not all) events  Events are stored in millions of files  Files distributed on many storage systems  To perform an analysis, a user needs to – Prepare an analysis –Write the analysis code –Specify the events of interest –Run an analysis 1.Locate the files containing the events of interest 2.Prepare disk space for the files 3.Transfer the files to the disks 4.Recover from any errors 5.Read the events of interest from files 6.Remove the files

Components of the Grid Collector Legend: red – new components, purple – existing components 1. Locate the files containing the events of interest –Event Catalog, file & replica catalogs 2. Prepare disk space and transfer a)Prepare disk space for the files –Disk Resource Manager (DRM) b)Transfer the files to the disks –Hierarchical Resource Manager (HRM) to access HPSS –On-demand transfers from HRM to DRM c)Recover from any errors –HRM recovers from HPSS failures –DRM recovers from network transfer failures 3. Read the events of interest from files –Event Iterator with fast forward capability 4. Remove the files –DRM performs garbage collection using pinning and lifetime Consistent with other SRM based strategies and tools

Grid Collector: Architecture Analysis code New query Event iterator Event Catalog In: conditions Out: logical files, event IDs File Locator In: logical name, Out: physical location Grid Collector File Scheduler In: physical file DRM Administrator Fetch tag file Load subset Rollback Commit Index Builder In: STAR tag file Out: bitmap index NFS, local disk Replica Catalog HRM 1 HRM 2 Clients Servers Replica Catalog

FastBit Index For Event Catalog  For 13 million events in a 62 GeV production (STAR 2004)  Event Catalog size (include base data and bitmap indices): 27 GB  tags: 6.0 GB (part of the base data of Event Catalog)  MuDST: 4.1 TB  event: 8.6 TB  raw: 14.6 TB  Time to produce tags, MuDST and event files from raw data: 3.5 months, 300+ CPUs  Time to build the catalog: 5 days, one CPU

Grid Collector Speeds up Reading  Test machine: 2.8 GHz Xeon, 27 MB/s read speed  Without Grid Collector, an analysis job reads all events  Speedup = time to read all events / time to read selected events with Grid Collector  Observed speedup ≥ 1  When searching for rare events, say, selecting one event out of 1000, using GC is 20 to 50 times faster

Grid Collector Speeds Up Actual Analysis  Speedup = time used with existing filtering mechanism / time used with GC selecting the same events  Tested on flow analysis jobs  Test data set contains 51 MuDST files, 8 GB, 25,000 events (P04ij)  Test data uses an efficient organization that enhances the filtering mechanism – reads part of the event data for filtering  Real analysis jobs typically include its own filtering mechanisms  Real analysis jobs may also spend significant amount of time perform computation  On a set of “real” analysis jobs that typically select about 10% of events, using Grid Collector has a speedup of 2 for CPU time, 1.4 for elapsed time. Speeding all jobs by 1.4 means the same computer center can accommodate 40% more analysis jobs

Grid Collector Enables Hard Analysis Jobs  Searching for anti- 3 He  Lee Barnby, Birmingham  Initial study identified collision events that possibly contain anti- 3 He, need further analysis (2000)  Searching for strangelet  Aihong Tang, BNL  Initial study identified events that may indicate existence of strangelets, need further investigation (2000)  Without Grid Collector, one has to retrieve every file from HPSS and scan them for the wanted events – may take weeks or months, NO ONE WANTS TO DO IT  With Grid Collector, both completed in a day

Summary Grid Collector  Makes use of two distinct technologies,  FastBit,  And SRM (Storage Resource Manager)  To speed up common analysis jobs where files are already on disk,  And, enable difficult analysis jobs where some files may not be on disk.  Contact Information  John Wu  Jerome Lauret