1 Database Management Systems: part of the solution or part of the problem? Clive Page 2004 April 28.

Slides:



Advertisements
Similar presentations
2003 May 24Clive Page Simple ADQL Enhancements Clive Page, AstroGrid Project University of Leicester, UK.
Advertisements

Database management system (DBMS)  a DBMS allows users and other software to store and retrieve data in a structured way  controls the organization,
Exadata Distinctives Brown Bag New features for tuning Oracle database applications.
Distributed databases
Transaction.
Distributed Databases Logical next step in geographically dispersed organisations goal is to provide location transparency starting point = a set of decentralised.
Database Software File Management Systems Database Management Systems.
Physical Database Design CIT alternate keys - named constraints - indexes.
Introduction to Structured Query Language (SQL)
Database Management: Getting Data Together Chapter 14.
A Guide to Oracle9i1 Advanced SQL And PL/SQL Topics Chapter 9.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Introduction to Structured Query Language (SQL)
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
Class 3 Data and Business MIS 2000 Updated: January 2014.
Relational Database Performance CSCI 6442 Copyright 2013, David C. Roberts, all rights reserved.
Databases with Scalable capabilities Presented by Mike Trischetta.
Clive Page University of Leicester Meeting at ROE January 25 (1)Cross-matching Catalogues (2)Column-based storage for data exploring.
1 Exploring Tabular Datasets Clive Page University of Leicester. SC4DEVO: 2004 December 2.
Cool white dwarfs in the Sloan & SuperCOSMOS Sky Surveys Nigel Hambly, Wide Field Astronomy Unit, IfA, University of Edinburgh.
2003 April 151 Data Centres: Connecting to the Real World Clive Page.
Systems analysis and design, 6th edition Dennis, wixom, and roth
ASP.NET Programming with C# and SQL Server First Edition
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Functions and Demo of Astrogrid 1.1 China-VO Haijun Tian.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
McGraw-Hill Technology Education © 2004 by the McGraw-Hill Companies, Inc. All rights reserved. Office Access 2003 Lab 3 Analyzing Data and Creating Reports.
Database Management 9. course. Execution of queries.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
1 Design Issues in XML Databases Ref: Designing XML Databases by Mark Graves.
G063 - Distributed Databases. Learning Objectives: By the end of this topic you should be able to: explain how databases may be stored in more than one.
A Brief Documentation.  Provides basic information about connection, server, and client.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
O FFICE M ANAGEMENT T OOL - II B BA -V I TH. Abdus Salam2 Week-7 Introduction to Query Introduction to Query Querying from Multiple Tables Querying from.
2003 May 24Clive Page Implementation of XMATCH function.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
1 Chapter 4: Creating Simple Queries 4.1 Introduction to the Query Task 4.2 Selecting Columns and Filtering Rows 4.3 Creating New Columns with an Expression.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Mining real world data RDBMS and SQL. Index RDBMS introduction SQL (Structured Query language)
Class 3 Data and Business MIS 2000 Updated: Jan
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Oracle9i Developer: PL/SQL Programming Chapter 11 Performance Tuning.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.
BIG DATA/ Hadoop Interview Questions.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
1 Section 1 - Introduction to SQL u SQL is an abbreviation for Structured Query Language. u It is generally pronounced “Sequel” u SQL is a unified language.
Scaling PostgreSQL with GridSQL. Who Am I? Jim Mlodgenski – Co-organizer of NYCPUG – Founder of Cirrus Technologies – Former Chief Architect of EnterpriseDB.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Database System Concepts and Architecture
Cross-matching the sky with database server cluster
Database Performance Tuning and Query Optimization
Session #, Speaker Name Indexing Chapter 8 11/19/2018.
G063 - Distributed Databases
Overview of big data tools
Chapter 8 Advanced SQL.
Chapter 11 Database Performance Tuning and Query Optimization
Presentation transcript:

1 Database Management Systems: part of the solution or part of the problem? Clive Page 2004 April 28

2 Cone search Easy to provide cone-search using either: –Spatial index e.g. R-tree, available with most DBMS –Pixel-code method, e.g. HTM or HEALPix. Problems –Scalability: search for Brown Dwarfs in Hyades had initial SELECT returning 8 million rows.

3 Distributed Cone Search Several “Standard” protocols for cone-search over web: –CGI-based: GLU from CDS: US-NVO protocol: –XML-based: ADQL has function called REGION Results hard to combine or merge because of lack of compatible metadata – UCDs not yet widespread.

4 Cross-matching Catalogues Easy with true spatial indexing e.g R-trees buy join syntax is very DBMS-specific. Can write a range of back-ends. Harder if only pixel-code method (e.g. HTM) in use –Algorithm more complex –User needs to have privilege to CREATE INDEX –Slow and less scalable than join with spatial index.

5 Cross-match cases Cross-match of user’s own table with standard catalogues –Needs table upload: formats? FITS, VOTable, TSV… –Need to generate error-box columns e.g. using ALTER TABLE t ADD COLUMN (errbox box); UPDATE t SET errbox = box(whatever); Cross-match of two standard catalogues, or sample from one catalogue cross-matched with another –Easy if stored within same DBMS

6 Distributed Cross-match Often need all or most of information in the smaller table to be included in the results – easier to copy table as a unit DBMS join algorithms need a lot of network activity, not performed well over the wide-area network –Tests using Postgres+dblink show speeds about 7 times lower than when both tables in same DBMS Could be done with a series of cone-searches plus merger of the results. Not yet tested, but also bound to be slow.

7 Analysis of Brown Dwarf Search in Hyades Cluster Most naturally done incrementally, e.g. 1.Select stars in the right patch of sky (cone-search) from USNO-B 2.Select from them stars with proper- motion vector in the right range 3.Compute projected proper-motion 4.Cross-match with 2MASS 5.Select stars with appropriate redness.

8 SQL for steps (1) and (2) SELECT * FROM usnob WHERE REGION('Circle J ') AND pmra BETWEEN -50 AND -150 AND pmdec BETWEEN 0 AND 100 AND acos( (sin(radians(7.6)) - sin(radians(decl)) * cos(sin(radians(decl)) * sin(radians(7.6)) + cos(radians(decl)) * cos(radians(7.6)) * cos(radians(94.2- ra))))) / (cos(radians(decl)) * sin(acos(sin(radians(decl)) * sin(radians(7.6)) + cos(radians(decl)) * cos(radians(7.6)) * cos(radians(94.2-ra)))) )) BETWEEN atan2(pmra,pmdec) * sqrt(ra_err*ra_err + dec_err*dec_err)/ sqrt(pmra*pmra + pmdec*pmdec) AND atan2(pmra,pmdec) * sqrt(ra_err*ra_err + dec_err*dec_err)/ sqrt(pmra*pmra + pmdec*pmdec);

9 User Facilities Needed SELECT INTO new-table (within DBMS) UPDATE table SET column = expression CREATE and DROP table Probably: CREATE and DROP index Separate namespace for each user –Can be done with Schemas of DB2 and Postgres Export of tables to MySpace or user’s computer in suitable Do we want some query builder to aggregate simple selections into monster SQL? Problems: –ADQL only supports SELECT so far –But JDBC (probably) supports all these statements

10 Non-positional Queries Simple example: –SELECT FROM table WHERE (bmag – vmag) > 1.5; Fast only if index exists on (bmag-vmag) –Infeasible in tables with many parameter combinations A great many such queries will need a scan of the whole table For tables the size of USNO-B or 2MASS this takes around 30 – 60 minutes even on a fast DBMS system.

11 Speeding up scanning queries Two obvious ways of storing data in a table –Row-based – makes transactions easy: used by all RDBMS –Column-based – better for read-only tables and queries involving only a few columns out of many: used by hardly any packages. Store data in binary file not in DBMS –Do both: reduces time to scan whole of one column of USNO-B by factor of 80, e.g. from an hour to under one minute.

12 Sybase-IQ Sybase-IQ uses column-based storage, efficient data formats, advanced indexing methods e.g bit-mask indices, supports same SQL as regular Sybase-ASE. But… –Not yet available on Linux, only Solaris –Speed apparently only few times better than RDBMS –Has no spatial indexing –List price £35,000/cpu, or AstroGrid site licence for “under £2M”.

13 Use of parallel hardware? Most DBMS not designed to exploit simple PC clusters ORACLE/RAC needs special “shared-everything” cluster hardware configuration. Many DBMS support replication but only to improve resilience by fail-over to another node –Transactions very hard to distributed to cluster. –Read-only databases not of much commercial interest. It would be interesting to try do-it-yourself parallelism e.g. –Install DBMS on each node of a Beowulf cluster –Load section of large catalogue on each node –Gather and merge results from distributed queries on master node.

14 Other DBMS problems RDBMS have very limited statistical functionality, no graphical output or facilities for visualisation. –Can solve by exporting data to other packages, but Awkward slow loses metadata

15 Other ADQL enhancements needed Syntax for cross-match is not yet mature: –Match criterion uses N “sigma”, should use probability. –Cannot specify outer join (report unmatched sources). Need to support physical units: SELECT FROM table WHERE properMotion > ; –Have standard units for each UCD? –Units to be specified with each constant? Provide users with estimate of running-time, e.g. using commonly-provided EXPLAIN statement. Need time-out on long-running queries. Need exception-handling mechanism.