Indexing and Selection of Data Items Using Tag Collections Sebastien Ponce CERN – LHCb Experiment EPFL – Computer Science Dpt Pere Mato Vila CERN – LHCb.

Slides:



Advertisements
Similar presentations
Final Project of Information Retrieval and Extraction by d 吳蕙如.
Advertisements

23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
1 Overview of Storage and Indexing Chapter 8 (part 1)
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Sensor Data Management with Model-based View LSIR, EPFL.
ObjectStore Martin Wasiak. ObjectStore Overview Object-oriented database system Can use normal C++ code to access tuples Easily add persistence to existing.
Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Module 8: Monitoring SQL Server for Performance. Overview Why to Monitor SQL Server Performance Monitoring and Tuning Tools for Monitoring SQL Server.
CERN/IT/DB Multi-PB Distributed Databases Jamie Shiers IT Division, DB Group, CERN, Geneva, Switzerland February 2001.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Master Thesis Defense Jan Fiedler 04/17/98
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
Copyright © 2000 OPNET Technologies, Inc. Title – 1 Distributed Trigger System for the LHC experiments Krzysztof Korcyl ATLAS experiment laboratory H.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
Intro – Part 2 Introduction to Database Management: Ch 1 & 2.
4/5/2007Data handling and transfer in the LHCb experiment1 Data handling and transfer in the LHCb experiment RT NPSS Real Time 2007 FNAL - 4 th May 2007.
Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
LOGO Development of the distributed computing system for the MPD at the NICA collider, analytical estimations Mathematical Modeling and Computational Physics.
The LHCb CERN R. Graciani (U. de Barcelona, Spain) for the LHCb Collaboration International ICFA Workshop on Digital Divide Mexico City, October.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Jean-Roch Vlimant, CERN Physics Performance and Dataset Project Physics Data & MC Validation Group McM : The Evolution of PREP. The CMS tool for Monte-Carlo.
Some Ideas for a Revised Requirement List Dirk Duellmann.
Predrag Buncic Future IT challenges for ALICE Technical Workshop November 6, 2015.
ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.
Computing for LHC Physics 7th March 2014 International Women's Day - CERN- GOOGLE Networking Event Maria Alandes Pradillo CERN IT Department.
Andrea Valassi (CERN IT-DB)CHEP 2004 Poster Session (Thursday, 30 September 2004) 1 HARP DATA AND SOFTWARE MIGRATION FROM TO ORACLE Authors: A.Valassi,
ORACLE & VLDB Nilo Segura IT/DB - CERN. VLDB The real world is in the Tb range (British Telecom - 80Tb using Sun+Oracle) Data consolidated from different.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Joint Institute for Nuclear Research Synthesis of the simulation and monitoring processes for the data storage and big data processing development in physical.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.
ISC321 Database Systems I Chapter 2: Overview of Database Languages and Architectures Fall 2015 Dr. Abdullah Almutairi.
Scalability of Local Image Descriptors Björn Þór Jónsson Department of Computer Science Reykjavík University Joint work with: Laurent Amsaleg (IRISA-CNRS)
The purpose of a CPU is to process data Custom written software is created for a user to meet exact purpose Off the shelf software is developed by a software.
ATLAS – statements of interest (1) A degree of hierarchy between the different computing facilities, with distinct roles at each level –Event filter Online.
Information Retrieval in Practice
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
Conditions Data access using FroNTier Squid cache Server
New developments on the LHCb Bookkeeping
Introduction to Query Optimization
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
File Organizations Chapter 8 “How index-learning turns no student pale
Spatial Online Sampling and Aggregation
Data Lifecycle Review and Outlook
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
File Organizations and Indexing
File Organizations and Indexing
Selected Topics: External Sorting, Join Algorithms, …
Relational Query Optimization (this time we really mean it)
Event Storage GAUDI - Data access/storage Framework related issues
Planning next release of GAUDI
Fast Accesses to Big Data in Memory and Storage Systems
File Organizations and Indexing
Presentation transcript:

Indexing and Selection of Data Items Using Tag Collections Sebastien Ponce CERN – LHCb Experiment EPFL – Computer Science Dpt Pere Mato Vila CERN – LHCb Experiment Roger D. Hersch EPFL – Computer Science Dpt March 27, 2002

2 Overview  The context : LHCb problems  A new indexing Schema  Selection Process  Theoretical Performance  First measurements

3 Context  Work developed as part of the LHCb experiment at CERN (European Organization for Nuclear Research)  Final aim is high energy particle physics, to study the behavior of the B-Meson and the CP-violation.  Tool : the LHCb detector, being built on the future CERN accelerator : the LHC (Large Hadron Collider)  The principle is to look at billions of particle collisions every second and understand what’s happening

4 The LHC length : 27 km depth : m Near Geneva On the French- Swiss border

5 LHCb Width : 18m Length : 12m Height : 12m Weight : 4.3t

6 Some Figures  Particle collision every 25 ns (40 millions per second).  channels  1 MB of data for each collision  Net result of 40 TB/s of output data  24h a day, 6 months per year (15 millions seconds each year)  + simulations & reconstructed data  * 3 BUT  Interesting physics phenomena are really seldom A very efficient three levels trigger system removes % of the collisions (keeps 200 events per second)  Only 100 KB are kept for each event “Only” 20 MB/s or ~.3 PB/year are stored for real data  Still ~ 1PB/year in total

7 Data Content  The basic item is an event  Events are independent one from the other  A “per event” indexing is needed in order to make a selection among the events (real + simulated + reconstructed)  The content of an event is a mix of booleans, strings, numbers  Size and content of an event may vary

8 Data Selection Needs  Typical physics analysis : selection of interesting events download these events compute some histogram modify the criteria and restart Selection is highly important  Selection characteristics : many variables (up to 30, typically 10-15) mixture of types (boolean, numbers, strings) complicated rules, that may need a structured language

9 Previous Solution  Sequential scan of the whole database.  Every item was converted to a C ++ structure and the selection was carried out in the code  Weber et al (1) demonstrated that this approach was the best one in high dimension The goal is to optimize this sequential scan (1)R. Weber, H.-J. Schek, and S. Blott. A Quantitative Analysis and Performance Study for Similarity- Search Methods in High-Dimensional Spaces. VLDB’98

10 Tags  A tag contains : a subset of the data item it represents a “pointer” to this item  The subset of the item contains few values that will be available for fast selection criteria  A tag is a small, well- structured entity that can be easily stored in a relational database Event blablabla Energy blablabla blablabla blablabla blablabla blablabla blablabla NbOfTracks blabla InteractionType blablabla blablabla blablabla blablabla blablabla blabla MuonChamberDeposit floatEnergy intNbOfTracks intInteractionType floatMuonChamberDeposit stringPointer to Event Tag

11 Tag Types  Several types of tags can be defined for a single event  Their content depends on the type of analysis Event floatEnergy intNbOfTracks intInteractionType floatMuonChamberDeposit stringPointer to Event Tag2 floatEnergy floatCaloEfficiency floatCaloDeposit floatCaloNoiceLevel stringPointer to Event Tag1 blablabla Energy blablabla CaloEfficiency CaloDeposit blablabla blablabla blablabla blablabla blablabla NbOfTracks blablabla InteractionType blablabla blablabla blabla blablabla blablabla CaloNoiceLevel blabla blablabla blablabla MuonChamberDeposit

12 Tag Collections  A Tag Collection is a list of tags of the same type.  There may be many collections with the same type. item 101 item 102 … item n1 Data items item 201 item 202 … item n2 Data items ptr x1 … xn … Tag Collection “TC1” ptr y1 … yp … Tag Collection “TC2” name location type … TC1 … TCn List of Tag Collections name description type1 x1, …, xn … typen y1, …, yp Tag Types

13 Selection Process  The selection process is very flexible :  Selection of the tag collection implies a reduction of the number of data items of interest  Server-side preselection on tags using SQL- like criteria  Client-side refinement on tags using a high level programming language to maximize the preselection efficiency  Carry out the final refinement by reading selected full data items (high level programming language)

14 Selection Process (2) Accessing tags Accessing Data Client Tag Server Data Server tags GB data > 1PB Query a given tag collection Data retrieval (Full Event) Matching tags sent back Local refinement Request for data items Data items sent back Last refinements Server side processing (DB)

15 Performance  The performance of the new retrieval schema can be evaluated by comparing it with a sequential scan  Approximations : data contains only integers no optimizations at all (no pipelining, sequential scans...) no local refinement step  Performances are given under the form of ratios :

16 Processing Time Ratio  Slightly better than   Main improvement : use of reduced size tag collection ~1 tag size proportion of tags fulfilling SQL criteria proportion of items present in tag collection. size of values tested but not in the tag (last local refinement step)

17 Network Load Ratio   is due to the use of a tag collection (subset of events).   is the tag selection ratio  2 is a maximum. Depending on the latency, it can go down to 1 + ,  being the tag size versus the data item size  In practice,  << 1 (~1% in LHCb) and r NET <<  proportion of items present in tag collection. proportion of tags fulfilling SQL criteria

18 Retrieval Ratio (From Disk)   is due to loading small tags instead of larger items   is the tag selection ratio   is due to the use of a tag collection (subset of events).  usually  << 1 and  << 1 (10 -4 and in LHCb) thus r DR <<   Tag size versus selection efficiency can be optimized proportion of items present in tag collection. proportion of tags fulfilling SQL criteria tag size versus data item size Reading Tags Reading Data Items

19 Net Gain for LHCb  Typical values for are : Proportion of items in a collection :  ~ Tag size versus item size :  ~ Proportion of tags fulfilling SQL criteria :  ~  Typical gains are CPU time : r CPU ~10 -4 Network load : r NET ~ Retrieval time : r DR ~10 -6

20 First Measurements  The proposed schema is implemented within Gaudi (C ++ LHCb event computation framework)  Measurement conditions : MySQL as a database. Items of 160 KB, tags reduced to 15 B (  ~ ) Only 5000 events in total (~800 MB) No network, few CPU needed Bottleneck is the retrieval from hard disk  Overall ratio essentially equal to   : proportion of items present in tag collection  proportion of items present in tag collection proportion of tags fulfilling SQL criteria  proportion of tags fulfilling SQL criteria

21 Dependence on  (Tag Collection Size)  N = 5000  No SQL selection  Measured dependency on  is linear as expected

22 Dependence on  (Selectivity of the SQL query)  N = 5000  Tag Collection containing all data items  Dependency on  is linear as expected

23 Conclusions  Tag collections based indexing allows : various and powerful preselections (tag collection, SQL, high level programming language) optimized network load (of the order of loading only matching items) Large global gains (at least 10 4 for LHCb)  Although developed as solution to a specific problem, the method is generic : adapted to data selection problems with highly selective multidimensional criteria, making use of a small subset of the data items  Tag collections may be accessed more efficiently by using existing indexing techniques on tags.

24 Future Work  The data selection schema will be parallelized : retrieving of tags/data items in parallel carrying out I/O and local refinement as a pipeline  Interface with Grid software is foreseen : storage of data items in world-wide distributed databases replication of the tag collections on different sites