Organizing the Extremely Large LSST Database for Real-Time Astronomical Processing ADASS London, UK September 23-26, 2007 Jacek Becla 1, Kian-Tat Lim 1,

Slides:



Advertisements
Similar presentations
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Advertisements

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Presented by Marie-Gisele Assigue Hon Shea Thursday, March 31 st 2011.
CS 153 Design of Operating Systems Spring 2015
20 Spatial Queries for an Astronomer's Bench (mark) María Nieto-Santisteban 1 Tobias Scholl 2 Alexander Szalay 1 Alfons Kemper 2 1. The Johns Hopkins University,
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
Energy Efficient Prefetching with Buffer Disks for Cluster File Systems 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
Building a Framework for Data Preservation of Large-Scale Astronomical Data ADASS London, UK September 23-26, 2007 Jeffrey Kantor (LSST Corporation), Ray.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
MM File Management Karrie Karahlaios and Brian P. Bailey Spring 2007.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug , 2008.IDIES Inaugural Symposium, Baltimore1.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
DC2 Postmortem Association Pipeline. AP Architecture 3 main phases for each visit –Load: current knowledge of FOV into memory –Compute: match difference.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Astronomy, Petabytes, and MySQL MySQL Conference Santa Clara, CA April 16, 2008 Kian-Tat Lim Stanford Linear Accelerator Center.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh1 High-Speed Access for an NVO Data Grid Node María A. Nieto-Santisteban, Aniruddha R.
Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
CPSC Why do we need Sorting? 2.Complexities of few sorting algorithms ? 3.2-Way Sort 1.2-way external merge sort 2.Cost associated with external.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
Parallel Databases.
External Sorting Chapter 13
CSE-291 (Cloud Computing) Fall 2016
Cross-matching the sky with database server cluster
Database Implementation Issues
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
External Sorting Chapter 13
Selected Topics: External Sorting, Join Algorithms, …
DATABASE IMPLEMENTATION ISSUES
13.3 Accelerating Access to Secondary Storage
External Sorting.
Database Implementation Issues
External Sorting Chapter 13
Database Implementation Issues
LSST, the Spatial Cross-Match Challenge
Presentation transcript:

Organizing the Extremely Large LSST Database for Real-Time Astronomical Processing ADASS London, UK September 23-26, 2007 Jacek Becla 1, Kian-Tat Lim 1, Serge Monkewitz 2, Maria Nieto-Santisteban 3, Ani Thakar 3 1 Stanford Linear Accelerator Center 2 California Institute Of Technology 3 Johns Hopkins University

ADASS September 23-26, 2007 London, UK 2 Expected LSST Database Volumes 16x10 12 rows 6 petabytes (data only) 14 petabytes (data and indexes) SDSS – 68x10 9 rows – 0.02 petabyte LSST 700x larger, but much later

ADASS September 23-26, 2007 London, UK 3 Real-Time Alert Generation Goal: generate transient alerts in under 60 sec – Association <10 sec Approach: – Minimize I/O – Maximize computational efficiency Transfer to base Source detection Association Alert generation Image processing <60 sec goal

ADASS September 23-26, 2007 London, UK cross-correlate Association Steps Difference Source Collection new detections, aver. 40K, max 100K per “visit” Object Catalog historical data, ~50x10 9 update (~40K) mark matched sources MOPS Catalog known moving objects w/predicted positions cross-correlate mark matched sources add new objects for unmatched sources (~1K) persist new sources persist new objects and updates Updates must be available when subsequent observations are processed

ADASS September 23-26, 2007 London, UK 5 I/O: Large Volumes, Sparse Updates DIASource CatalogObject Catalog save detected sources locate, then read objects for given FOV, 4x10 6 on average update matching objects, ~40K add ~1K objects Potentially large volume of data accessed Many small, sparse updates and writes

ADASS September 23-26, 2007 London, UK 6 Minimizing and Sequentializing I/O Match only against objects in or near processed FOV –reduces problem from 100K x 50x10 9 to 100K x 4x10 6 Fetch only columns used during cross-match –reduces problem from 4x10 6 x 1,658 to 4x10 6 x 24 = ~100MB Cluster together data accessed together (spatially) Maintain specialized copy of data used by cross-match –RA, Decl, objectId for objects Partition data, round-robin across several disks Serve data from memory during time-critical period –removes dependency on slow disks

ADASS September 23-26, 2007 London, UK 7 Minimizing and Sequentializing I/O Match against processed FOV Cluster data spatially Keep specialized copyServe from memoryPartition data Scan whole catalog - naive Fetch used columns only

ADASS September 23-26, 2007 London, UK 8 Partitioning Data Divide difference sources and objects into declination stripes and physically partition into chunks –optimal #partitions: O(300,000) Stripes are a fraction of a field-of-view high (less than one to minimize wasted I/O around the circular FOV) Chunks are one stripe height wide

ADASS September 23-26, 2007 London, UK 9 Serving Data From Memory Pre-fetch variable objects to memory prior to processing when FOV known Time-critical period: –lookup newly variable objects on disk (<1,000) –all other reads/writes from RAM (98+% of I/O) Flush updates to disk after processing Get back fault tolerance by mirroring

ADASS September 23-26, 2007 London, UK 10 Maximizing Computational Efficiency: Algorithm Generate declination zones dynamically in-memory –Zones are several search radii high more than one to minimize index overhead Sort data within zones by RA on the fly Match difference sources and objects within zones

ADASS September 23-26, 2007 London, UK 11 Maximizing Computational Efficiency: Implementation Push computation from database to application –specialized algorithm faster than generic database approach –moving data to computation, not computation to data works because we significantly reduced accessed data volume –result: significant speed up Parallelize computation –by zone

ADASS September 23-26, 2007 London, UK 12 Results Average real-LSST scale test –~40K sources vs ~4 million objects Using a single 2002-era server –SunFire V240 UltraSparc IIIi 1.5GHz Results –read and index object data: 9.4 sec –read and index source data: 0.91 sec –perform cross-match: 0.14 sec Required: <30 sec } Required: <10 sec

ADASS September 23-26, 2007 London, UK 13 Summary LSST Real-Time spatial cross-match –large scale –stringent timing requirements –still easily doable on a single server with appropriate setup and data organization