Cross-matching the sky with database server cluster

Slides:



Advertisements
Similar presentations
Scalable Web Site Antipatterns Justin Leitgeb Stack Builders Inc.
Advertisements

9 September 2005NVO Summer School Aspen Astronomical Dataset Query Language (ADQL) Ray Plante T HE US N ATIONAL V IRTUAL O BSERVATORY.
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
The Big Picture Scientific disciplines have developed a computational branch Models without closed form solutions solved numerically This has lead to.
Big Data Working with Terabytes in SQL Server Andrew Novick
László Dobos 1,2, Tamás Budavári 2, Nolan Li 2, Alex Szalay 2, István Csabai 1 1 Eötvös Loránd University, Budapest,
IBM Software Group ® Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Organizing the Extremely Large LSST Database for Real-Time Astronomical Processing ADASS London, UK September 23-26, 2007 Jacek Becla 1, Kian-Tat Lim 1,
1 Advanced Database Technology February 12, 2004 DATA STORAGE (Lecture based on [GUW ], [Sanders03, ], and [MaheshwariZeh03, ])
Petascale Data Intensive Computing for eScience Alex Szalay, Maria Nieto-Santisteban, Ani Thakar, Jan Vandenberg, Alainna Wonders, Gordon Bell, Dan Fay,
Access Path Selection in a Relation Database Management System (summarized in section 2)
Fast Track, Microsoft SQL Server 2008 Parallel Data Warehouse and Traditional Data Warehouse Design BI Best Practices and Tuning for Scaling SQL Server.
Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug , 2008.IDIES Inaugural Symposium, Baltimore1.
TIG session 3+Millennium database Millennium Database Overview and some first usage experiences Gerard Lemson and the Virgo Consortium astro-ph/
EdSkyQuery-G Overview Brian Hills, December
Chapter 111 Chapter 11: Hardware (Slides by Hector Garcia-Molina,
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Spatial Indexing of large astronomical databases László Dobos, István Csabai, Márton Trencséni ELTE, Hungary.
Access Path Selection in a Relational Database Management System Selinger et al.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
Database Management 9. course. Execution of queries.
EÖTVÖS UNIVERSITY BUDAPEST Department of Physics of Complex Systems VO Spectroscopy Workshop, ESAC Spectrum Services 2007 László Dobos (ELTE)
1 Database Management Systems: part of the solution or part of the problem? Clive Page 2004 April 28.
Massive Stochastic Testing of SQL Don Slutz Microsoft Research Presented By Manan Shah.
Indexing and Visualizing Multidimensional Data I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,Budapest.
MySQL spatial indexing for GIS data in a web 2.0 internet application Brian Toone Samford University
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
IVOA Interoperalibity JVO Query Language Naoki Yasuda (NAOJ/Japanese VO)
Some Grid Science California Institute of Technology Roy Williams Paul Messina Grids and Virtual Observatory Grids and and LIGO.
SYS364 Database Design Continued. Database Design Definitions Initial ERD’s Normalization of data Final ERD’s Database Management Database Models File.
SimDB Implementation & Browser IVOA InterOp 2008 Meeting, Theory Session 1. Baltimore, 26/10/2008 Laurent Bourgès This work makes use of EURO-VO software,
JVO portal service Yuji Shirasaki National Astronomical Observatory of Japan.
January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh1 High-Speed Access for an NVO Data Grid Node María A. Nieto-Santisteban, Aniruddha R.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 12 – Introduction to.
Lecture 3 With every passing hour our solar system comes forty-three thousand miles closer to globular cluster 13 in the constellation Hercules, and still.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
1 CSE232A: Database System Principles Hardware. Data + Indexes Database System Architecture Query ProcessingTransaction Management SQL query Parser Query.
Slide 1 PS1 PSPS Object Data Manager Design PSPS Critical Design Review November 5-6, 2007 IfA.
Spatial Searches in the ODM. slide 2 Common Spatial Questions Points in region queries 1.Find all objects in this region 2.Find all “good” objects (not.
Indexing strategies and good physical designs for performance tuning Kenneth Ureña /SpanishPASSVC.
How is data stored? ● Table and index Data are stored in blocks(aka Page). ● All IO is done at least one block at a time. ● Typical block size is 8Kb.
Short History of Data Storage
Data Mining – Intro.
Astronomical Data Processing & Workflow Scheduling in cloud
Parallel Databases.
UFC #1433 In-Memory tables 2014 vs 2016
A Black-Box Approach to Query Cardinality Estimation
The Client/Server Database Environment
Potter’s Wheel: An Interactive Data Cleaning System
Sky Query: A distributed query engine for astronomy
Software Architecture in Practice
Database Performance Tuning and Query Optimization
Introduction to Query Optimization
Blazing-Fast Performance:
April 30th – Scheduling / parallel
1 Demand of your DB is changing Presented By: Ashwani Kumar
Akshay Tomar Prateek Singh Lohchubh
Statistics: What are they and How do I use them
Chapter 4 Indexes.
CH 4 Indexes.
CH 4 Indexes.
Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Chapter 11 Database Performance Tuning and Query Optimization
Query Optimization.
Efficient Catalog Matching with Dropout Detection
Database System Architecture
LSST, the Spatial Cross-Match Challenge
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Cross-matching the sky with database server cluster László Dobos Eötvös University Dept. of Physics of Complex Systems

Data tsunami 1959 4GB of punched cards

Driving force: Moore’s law

Exponential growth Electronics Detectors Data size

A−1 Tractable algorithms Data size ∝ et Computing power ∝ et Algorithms must be o(N) o(N log N) at most! A−1

Reading a 1TB disk Sequentially: 4.5 hours Random access: 15-150 day

Main limitation: Disk = tape = 

Solution: parallelize

Relational databases Optimized disk access Declarative prog. Model Ad hoc query support User-defined functions But: interpreted

RDBMS: scale-up

RDBMS: scale-out

RDBMS: scale-out

(sorted in the same order) Algorithms scan sort join random seek N log N only local! (sorted in the same order)

Sky surveys, catalogs

The multiwavelength sky infrared (2MASS) visible (DSS) ultraviolet (Galex)

Crossmatching Astronomical catalogs today Match by coordinates in RDBMS o(100 million) objects o(1TB – 10TB) DB size Match by coordinates RA, Dec Astrometric error Different sky coverage Different wavelength range Moving objects etc.

Sample XMatch query SELECT s.objId, g.objID, t.objID, s.ra, s.dec, g.ra, g.dec, t.ra, t.dec, x.ra, x.dec FROM SDSSDR7:Galaxies AS s WITH (POINT(s.cx, s.cy, s.cz), ERROR(0.1)) CROSS JOIN Galex:Galaxies AS g WITH (POINT(g.cx, g.cy, g.cz), ERROR(0.2)) CROSS JOIN TwoMASS:ExtendedSources AS t WITH (POINT(t.ra, t.dec), ERROR(0.5)) XMATCH BAYESIAN AS x MUST EXIST g, MAY EXIST t HAVING LIMIT 1e3 REGION 'CIRCLE J2000 165.7 0.3 60' Standard SQL Probabilistic crossmatch Spatial constraint

How to extend SQL SQL is declarative Query optimization is hard everything can be executed that can be expressed extensions must be executable in any case Query optimization is hard Design language for easy optimization in mind constrain on the level of the grammar custom clauses instead of complex where clause logic

Scaling out: query partitioning Zone algorithm to index sphere Mirror everything go for speed, old catalogs are small be able to partition small queries by RA avoid „sweet spot” problems Statistics gathered on the fly use sampled data with cardinality of ∛N apply WHERE clause and REGION

Zone algorithm Find close matches in catalog B of every object in catalog A Distance calculation is expensive Filter out distant objects using cheap algorithm Stripe sphere by Dec into zones Stripe height slightly bigger than typical match radius Divide stripes by RA into cells Filter by cells first Calculate distances within cells only nN ≪ N2, because n ≪ N

New generation SkyQuery

Graywulf software stack Generic infrastructure based on MSSQL Distributed queries for scientific data Import, export, scheduling, web access SQL parser and name resolver Schema browser, extended meta-data

SkyQuery features (ongoing work) Fetch data from remote sources Protocol plug-ins Pre-filtering using WHERE clause Works with MySQL, Postgres, no VO TAP yet VO compliancy via plug-ins Keep system generic and domain independent Region-based filtering Drop-out detection

SkyQuery (ongoing work) Storage optimizations Access to data columns is not equally distributed Move less-accessed data to big but slow storage Lazy joins Full SQL support Currently limited to single SELECT queries Add support for full T-SQL scripting

A projekt megvalósítása az OTKA-103244 projekt keretében történt. http://voservices.net/skyquery dobos@complex.elte.hu A projekt megvalósítása az OTKA-103244 projekt keretében történt.