Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics.

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Aug Arie Shoshani Particle Physics Data Grid Request Management working group.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
Grid Collector: Enabling File-Transparent Object Access For Analysis Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani.
ICNP'061 Benefit-based Data Caching in Ad Hoc Networks Bin Tang, Himanshu Gupta and Samir Das Computer Science Department Stony Brook University.
CS 104 Introduction to Computer Science and Graphics Problems
Midterm Tuesday October 23 Covers Chapters 3 through 6 - Buses, Clocks, Timing, Edge Triggering, Level Triggering - Cache Memory Systems - Internal Memory.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
STACS STACS: Storage Access Coordination of Tertiary Storage for High Energy Physics Applications Arie Shoshani, Alex Sim, John Wu, Luis Bernardo*, Henrik.
1 Computer System Overview Chapter 1 Review of basic hardware concepts.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
Grid Data Management A network of computers forming prototype grids currently operate across Britain and the rest of the world, working on the data challenges.
Basic Structure of Computer Computer Architecture Lecture – 2.
Lionel F. Lovett, II Jackson State University Research Alliance in Math and Science Computer Science and Mathematics Division Mentors: George Ostrouchov.
Chapter 4 Memory Management.
ICS 321 Fall 2011 Overview of Storage & Indexing (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 11/9/20111Lipyeow.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
SDM-Center talk Optimizing shared access to tertiary storage Arie Shoshani Alex Sim Alex Sim July 10, 2001 Scientific Data Management Group Computing Science.
1 GCA Application in STAR GCA Collaboration Grand Challenge Architecture and its Interface to STAR Sasha Vaniachine presenting for the Grand Challenge.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Scientific Data Management Research Group National Energy Research Scientific Computing Center, L B N L 1 Henrik Nordberg, June 1998 Query Estimator Henrik.
STAR Collaboration, July 2004 Grid Collector Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani Lawrence Berkeley National.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
CS4432: Database Systems II Query Processing- Part 2.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
PPDG meeting, July 2000 Interfacing the Storage Resource Broker (SRB) to the Hierarchical Resource Manager (HRM) Arie Shoshani, Alex Sim (LBNL) Reagan.
Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
Copyright ©: Nahrstedt, Angrave, Abdelzaher1 Memory.
Memory Management Chapter 5 Advanced Operating System.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
The HENP Grand Challenge Project and initial use in the RHIC Mock Data Challenge 1 D. Olson DM Workshop SLAC, Oct 1998.
Dense-Region Based Compact Data Cube
Chapter 2 Memory and process management
Lecture 16: Data Storage Wednesday, November 6, 2006.
Are they better or worse than a B+Tree?
Architecture Background
CSCE 990: Advanced Distributed Systems
CSCI1600: Embedded and Real Time Software
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 8 11/24/2018.
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 12/1/2018.
CS399 New Beginnings Jonathan Walpole.
Overview of Query Evaluation
Lecture 13: Query Execution
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 4/5/2019.
CSC3050 – Computer Architecture
Evaluation of Relational Operations: Other Techniques
Parallel Programming in C with MPI and OpenMP
CSCI1600: Embedded and Real Time Software
Parallel Feature Identification and Elimination from a CFD Dataset
Presentation transcript:

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics Applications Arie Shoshani Doron Rotem Henrik Nordberg Luis Bernardo ( Scientific Data Management R&D Group Lawrence Berkeley National Laboratory October, 1997

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 2 Events Clustering and Access: Main Components Tape reorganization module Multidimensional Clustering Analyzer Cluster index Access Patterns Input Dataset description Hardware Characteristics Clustering Spec Application Program Storage Manager Original input DST Restructured output DST data path interaction path n-tuple Properties Data Manager’s Workbench To disk cache M-dim query request extracted micro-DST

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 3 Main Tasks °Discover event clusters –based on natural distribution - Data Mining –based on access patterns - consult physicists –simulate performance - data manager’s workbench Manage cluster access –given a query, determine clusters to access, use multi-dimensional indexes to select events that qualify Reorganize DST tapes according to clusters –long process - done initially, then rarely –flow control - restart after interruption Cache management –determine if in cache, which incremental clusters to cache, which clusters to purge from cache

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 4 Discover events clusters Top down approach –partition each dimension into “bins” (e.g. 1-2 GEV,..., 1-3 pions,...) –select subset of dimensions based on physicist’s experience –analyze which events fall into the same “cell” (i.e. m-dim rectangles formed by the bins) –eliminate empty cells –combine cells to form similar size clusters Assumption –most queries are “range queries”

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 5 Top Down Cell Management Assume: 7 dimensions, 10 bins each -- Number of cells: 10 7, 4 byte counters -- Number of bytes: 4x10 7, 40 MB For e.g. small dataset 97% of cells are empty: -- store only populated cells -- use hash tables to locate existing cells -- use 2 bytes for bin_id per property: ratio for p% full is: 200/ (n+2)p -- No of bytes: for 7 dim, 3% => 5.4 MB Sort cells by size (number of events) no of events cell #

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 6 Cluster discovery sort cells by size pick larger cell to start forming a cluster find all neighbors of “Manhattan distance” equal to 1 include cells above a threshold iterate for all cells in cluster when no more cells above threshold, pick larger remaining cell and start forming a new cluster Display cluster distributions

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 7 The effect of bin sizes on events distribution

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 8 Blowup of cell histogram 1331 cells 3328 cells 7056 cells 33,046 cells 202,581 cells 1,030,301 cells 8,120,601 cells random data

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 9 Cluster Stability number of cells number of clusters bend Flat region We apply the clustering algorithm while increasing the number of bins (and thus the number of cells) Initially the number of clusters is small (cells are too large) The flat region shows stability in the number of clusters When cells are too small the number of cell grows The lower bend is the largest cell size that shows clusters

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 10 Cache Management Hardware scenarios tape library disk RAID HPSS CPU farm tape library parallel disk system HPSS CPU farm scenario A scenario B

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 11 Cache Management Issues Scenario A –RAID more expensive than a Parallel Disk System (factor 2-3) –but, rely on HPSS to manage disk –storage management simplified Scenario B –a Parallel Disk System is cheaper, does not depend on RAID vendor –but, need to manage disk allocation –has control over placement of events on cache Planned initial pilot –Scenario A under NERSC

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 12 Situation at NERSC To test scenario A (will be done first) –tertiary storage AND cache under HPSS –use for “event reorganization module” only –dedicated experimental environment rs6000 connected to HPSS, dedicated RAID partition experiment with “stage”, “migrate”, and “purge” To test scenario B (later) –tertiary storage under HPSS, cache: DPSS –link not set up yet

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 13 Situation at NERSC (Con’d) To test Analysis Framework communication with Storage Manager –go over the net from PDSF to NT-machine using CORBA To test Analysis Framework interacting with a cache (which is controlled by the Storage Manager) –can’t run analysis framework on T3E and C90 –must use PDSF - DPSS

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 14 Query Estimator Cache Manager Query Monitor index cache map (1) Query estimation request (2) Query execution request (3) Notify that cell was cached or removed (5) Finished with cell / abort (6) Ask if finished with cell (4) Notify that cell was cached Object Manager Query Object Event Iterator Analysis Framework (7) Request to re-cache cell Storage Manager Communication between The Analysis Framework and the Storage Manager

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 15 Typical Scenario Query Estimation request (can be repeated many times) Query Execution request Query Monitor makes decision which cells to cache Cache Manager checks if cell is in cache If so notifies process (event iterator) If not it schedules a “stage cell”, “purge” unused cells When cells is cached, Cache Manager notifies Object Manager, as well as Event Iterator Event Iterator notifies Query Monitor when it is done with a cell or when it aborts

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 16 Process Misbehavior Use of a time out mechanism by the Query Monitor Query Monitor asks Event Iterator “are you alive” If no response within some time limit, all cells are removed for this process (if no other processes need them) Event Iterator can request a re-cache of a cell => status of queries that are “non-responsive” is saved for a period of time

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 17 Query Estimator Response Approximate Estimate (quick response) (index in memory) –No_of_events: (min,max) -- nearest bin boundaries –no_of_cells to be cached –total_MBs_to_be_moved –%_of_events_in_cells that qualify for a query (max) –no_of_events_in_cache –time_to_process_query Precise estimate (slower response) (Index on disk) –Precise No_of_events that qualify –ALL other measures the same

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 18 Cache Management Strategy Number of cells to fetch ahead - parameter –e.g. 2 cells => new cell cached only after process informs that it is finished with one of the 2 cells Priority given to cells needed by multiple processes Priority given to cells that reside on the same tape Algorithm should not starve a process