Promised Abstract Can database technology help manage and mine scientific data? That is the question I have been trying to answer with my astronomy colleagues.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.
Spatial (or N-Dimensional) Search in a Relational World Jim Gray.
Phoenix We put the SQL back in NoSQL James Taylor Demos:
Analyzing Large Datasets in Astrophysics Alexander Szalay The Johns Hopkins University Towards an International Virtual Observatory, Garching, 2002 (Living.
Trying to Use Databases for Science Jim Gray Microsoft Research
World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
Web Services for the Virtual Observatory Alex Szalay, Tamas Budavari, Tanu Malik, Jim Gray, and Ani Thakar SPIE, Hawaii, 2002 (Living in an exponential.
1 Online Science the New Computational Science Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research Talk at
1 Distributed Computing Economics Slides at: Grayhttp://research.microsoft.com/~gray/talks Microsoft Research.
1 Online Science -- The World-Wide Telescope as an Archetype Jim Gray Microsoft Research Collaborating with: Alex Szalay, Peter Kunszt, Ani
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Structured Query Language (SQL)
Recommendations for a Table Access Protocol Ray Plante, Tamas Budavari, Gretchen Greene, John Goode, Tom McGlynn, Maria Nieto-Santistaban, Alex Szalay,
9 September 2005NVO Summer School Aspen Astronomical Dataset Query Language (ADQL) Ray Plante T HE US N ATIONAL V IRTUAL O BSERVATORY.
Dr. Alexandra I. Cristea CS 252: Fundamentals of Relational Databases: SQL5.
CS 440 Database Management Systems RDBMS Architecture and Data Storage 1.
SQL Server performance tuning basics
1 Web-Enabled Decision Support Systems Access Introduction: Touring Access Prof. Name Position (123) University Name.
Database Chapters.
© IBM Corporation Informix Chat with the Labs John F. Miller III Unlocking the Mysteries Behind Update Statistics STSM.
Drop in replacement of MySQL. Agenda MySQL branch GPL licence Maria storage engine Virtual columns FederatedX storage engine PBXT storage engine XtraDB.
1 Query-by-Example (QBE). 2 v A “GUI” for expressing queries. –Based on the Domain Relational Calulus (DRC)! –Actually invented before GUIs. –Very convenient.
Management Information Systems, Sixth Edition
Dos and don’ts of Columnstore indexes The basis of xVelocity in-memory technology What’s it all about The compression methods (RLE / Dictionary encoding)
20 Spatial Queries for an Astronomer's Bench (mark) María Nieto-Santisteban 1 Tobias Scholl 2 Alexander Szalay 1 Alfons Kemper 2 1. The Johns Hopkins University,
ISD3 Chris Wallace Next 6 Weeks Extended Relational Model Object Orientation Matching systems 3 tier architecture Technology.
SDSS Web Services Tamás Budavári Johns Hopkins University Coding against the Universe.
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
Rationale Aspiring Database Developers should be able to efficiently query and maintain databases. This module will help students learn the Structured.
DBA Developer. Responsibilities  Designing Relational databases  Developing interface layer Environment Microsoft SQL Server,.NET SQL Layer: Stored.
Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.
László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug , 2008.IDIES Inaugural Symposium, Baltimore1.
Astronomical Data Query Language Simple Query Protocol for the Virtual Observatory Naoki Yasuda 1, William O'Mullane 2, Tamas Budavari 2, Vivek Haridas.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
CHAPTER:14 Simple Queries in SQL Prepared By Prepared By : VINAY ALEXANDER ( विनय अलेक्सजेंड़र ) PGT(CS),KV JHAGRAKHAND.
SQL pepper. Why SQL File I/O is a great deal of code Optimal file organization and indexing is critical and a great deal of code and theory implementation.
Introduction to the new mainframe © Copyright IBM Corp., All rights reserved. Chapter 12 Understanding database managers on z/OS.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Promised Abstract Can database technology help manage and mine scientific data? That is the question I have been trying to answer with my astronomy colleagues.
SQL Fundamentals  SQL: Structured Query Language is a simple and powerful language used to create, access, and manipulate data and structure in the database.
SkyServer Database Past, Present, and Future Jim Gray: Microsoft Alex Szalay (and friends): Johns Hopkins Help from: Cathan Cook (personal SkyServer),
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Stored Procedure Optimization Preventing SP Time Out Delay Deadlocking More DiskReads By: Nix.
William O’Mullane/ Tannu Malik - JHU IVOA Cambridge May 12-16, 2003 SkyQuery.Net SKYQUERY Federated Database Query System (using WebServices)
Session id: Darrell Hilliard Senior Delivery Manager Oracle University Oracle Corporation.
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
Thinking in Sets and SQL Query Logical Processing.
SQL pepper. Why SQL File I/O is a great deal of code Optimal file organization and indexing is critical and a great deal of code and theory implementation.
Dynamic SQL Writing Efficient Queries on the Fly ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.
Spatial Searches in the ODM. slide 2 Common Spatial Questions Points in region queries 1.Find all objects in this region 2.Find all “good” objects (not.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Creating Database Objects
Query-by-Example (QBE)
Relational Algebra Chapter 4, Part A
Searching Business Data with MOSS 2007 Enterprise Search
ISC440: Web Programming 2 Server-side Scripting PHP 3
DATABASE MANAGEMENT SYSTEM
BARC Scaleable Servers
JULIE McLAIN-HARPER LINKEDIN: JM HARPER
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Relational Algebra Chapter 4, Sections 4.1 – 4.2
DAT381 Team Development with SQL Server 2005
Creating Database Objects
Why is it important? A first cut at a logging interface
Introduction to SQL Server and the Structure Query Language
Presentation transcript:

Promised Abstract Can database technology help manage and mine scientific data? That is the question I have been trying to answer with my astronomy colleagues (especially Alex Szalay.) We have had some success but still face many problems. I will start by describing the odyssey of putting the Sloan Digital Sky Survey online and give some statistics about how it is used and what we are doing now. That segment will end with a discussion of how the integration of SQL with the CLR (common language run time) makes it much easier for us to handle scientific datatypes and spatial access methods. The World-Wide Telescope is an attempt to federate all the astronomy archives of the world. I will briefly describe the architecture of SkyQuery a prototype portal to several archives each of which is a web service. Urls: Can database technology help manage and mine scientific data? That is the question I have been trying to answer with my astronomy colleagues (especially Alex Szalay.) We have had some success but still face many problems. I will start by describing the odyssey of putting the Sloan Digital Sky Survey online and give some statistics about how it is used and what we are doing now. That segment will end with a discussion of how the integration of SQL with the CLR (common language run time) makes it much easier for us to handle scientific datatypes and spatial access methods. The World-Wide Telescope is an attempt to federate all the astronomy archives of the world. I will briefly describe the architecture of SkyQuery a prototype portal to several archives each of which is a web service. Urls:

Session Code Yukon Features For SkyServer Database Jim Gray: Microsoft Alex Szalay (and friends): Johns Hopkins Help from: Cathan Cook (personal SkyServer), Maria A. Nieto-Santisteban (image cutout service)

SkyServer Overview (10 min) 10 minute SkyServer tour Pixel space Record space: Doc space: Ned Set space: Web & Query Logs Dr1 WebService You can download (thanks to Cathan Cook ) Data + Database code: Website: Data Mining the SDSS SkyServer Database Data Mining the SDSS SkyServer Database MSR-TR Data Mining the SDSS SkyServer Database select top 10 * from weblog..weblog where yy = 2003 and mm=7 and dd =25 order by seq desc select top 10 * from weblog..sqlLog order by theTime Desc

Cutout Service (10 min) A typical web service Show it Show WSDL Show fixing a bug Rush through code. You can download it. Maria A. Nieto-Santisteban did most of this (Alex and I started it)

SkyQuery: Distributed Query tool using a set of web services Fifteen astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England)… Feasibility study, built in 6 weeks Tanu Malik (JHU CS grad student) Tamas Budavari (JHU astro postdoc) With help from Szalay, Thakar, Gray Implemented in C# and.NET Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

2MASS INT SDSS FIRST SkyQuery Portal Image Cutout SkyQuery Structure Each SkyNode publishes Schema Web Service Database Web Service Portal is Plans Query (2 phase) Integrates answers Is itself a web service

Four Database Topics Sparse tables: column vs row store tag and index tables pivot Maplist (cross apply) Bookmark bug Object Relational has arrived.

Column Store Pyramid Users see fat base tables (universal relation) Define popular columns index tag table 10% ~ 100 columns Make many skinny indices 1% ~ 10 columns Query optimizer picks right plan Automate definition & use Fast read, slow insert/update Data warehouse Note: prior to Yukon, index had 16 column limit. A bane of my existence. Simpl e Typical Semi- join Fat quer y Obese query BASE INDICIES TAG

Examples create table base ( id bigint, f1 int primary key, f2 int, …,f1000 int) create index tag on base (id) include (f1, …, f100) create index skinny on base(f2,…f17) Simpl e Typical Semi-join Fat quer y Obese query BASE INDICIES TAG

A Semi-Join Example create table fat(a int primary key, b int, c int, fat char (988)) int; = 0 again: insert fat cast(100*rand() as int), cast (100*rand() as int), ' ') + 1; if < ) goto again create index ab on fat(a,b) create index ac on fat(a,c) dbcc dropcleanbuffers with no_infomsgs select count(*) from fat with(index (0)) where c = b -- Table 'fat'. Scan 3, reads 137,230, CPU : 1.3 s, elapsed 31.1s. dbcc dropcleanbuffers with no_infomsgs select count(*) from fat where b=c -- Table 'fat'. Scan 2, reads: 3,482 CPU 1.1 s, elapsed: 1.4 s. 1GB 8MB b=c 3.4K IO 1.4 sec abac b=c 137 K IO 31 sec

Moving From Rows to Columns Pivot & UnPivot What if the table is sparse? LDAP has 7 mandatory and 1,000 optional attributes Store row, col, value create table Features (object varchar, attribute varchar, value varchar, primary key (object, attribute)) select * from (featurespivot value on attribute in (year, color) ) as T where object = 4PNC450 Features object attribute value 4PNC450 year PNC450 color white 4PNC450 make Ford 4PNC450 model Taurus T Object year color 4PNC white

Maplist Meets SQL – cross apply Your table-valued function F(a,b,c) returns all objects related to a,b,c. spatial neighbors, sub-assemblies, members of a group, items in a folder,… Apply this function to each row Classic drill-down use outer apply if f() may be null select p.*, q.* from parent as p cross apply f(p.a, p.b, p.c) as q where p.type = 1 p1 f(p1) p2 f(p2) pn f(pn)

The Bookmark Bug SQL is a non-procedural language. The compiler/optimizer picks the procedure based on statistics. If the stats are wrong or missing…. Bad things happen. Queries can run VERY slowly. Strategy 1: allow users to specify plan. Strategy 2: make the optimizer smarter (and accept hints from the user.)

An Example of the Problem A query selects some fields of an index and of huge table. Bookmark plan: look in index for a subset. Lookup subset in Fat table. This is great if subset << table. terrible if subset ~ table. If statistics are wrong, or if predicates not independent, you get the wrong plan. How to fix the statistics? Index Huge table

A Fix: Let user ask for stats Create Statistics on View(f1,..,fn) Then the optimizer has the right data Picks the right plan. Statistics on Views, C. Galindo-Legaria, M. Josi, F. Waas, M. Wu, VLDB 2003, Q3: Select count(*) from Galaxy where r Bookmark: 34 M random IO, 520 minutes Create Statistics on Galaxy(objID ) Scan: 5 M sequential IO 18 minutes Ultimately this should be automated, but for now,… its a step in the right direction.

Object Relational Has Arrived VMs are moving inside the DB Yukon includes Common Language Runtime (Oracle & DB2 have similar mechanisms). So, C++, VB, C# and Java are co-equal with TransactSQL. You can define classes and methods SQL will store the instances Access them via methods You can put your analysis code INSIDE the database. Minimizes data movement. You cant move petabytes to the client But we will soon have petabyte databases. data code data code +code

And.. Fully-async and synchronous (blocking) calls and multi-concurrent-result sets per connection (transaction) Queues built in (service broker): Fire-and forget asynchronous processing It listens to Port 80 for SOAP calls : TP-lite is back Its a web service Notification service and data mining and olap and reporting and xml and xquery and.... ) But, back to OR.

Some Background Table valued functions SQL operates on tables. If you can make tables, you can extend SQL This is the idea behind OLE/DB create function int) table (a int) begin while > 0) begin if % 2 = 0) -1 end return end select * from Evens(10) a

Using table Valued Functions For Spatial Search Use function to return likely key ranges. Use filter predicate to eliminate objects outside the query box. Select objID From Objects O @radius) R on O.htmID between R.begin and R.end where abs(o.Lat + abs(o.Lon Table valued function returns candidate ranges of some space-filling curve. Filter discards false positives.

The Pre CLR design Transact SQL sp_HTM (20 lines) 469 lines of glue looking like: // Get Coordinates param datatype, and param length information of if (srv_paraminfo(pSrvProc, 1, &bType1, &cbMaxLen1, &cbActualLen1, NULL, &fNull1) == FAIL) ErrorExit("srv_paraminfo failed..."); // Is Coordinate param a character string if (bType1 != SRVBIGVARCHAR && bType1 != SRVBIGCHAR && bType1 != SRVVARCHAR && bType1 != SRVCHAR) ErrorExit("Coordinate param should be a string."); // Is Coordinate param non-null if (fNull1 || cbActualLen1 < 1 || cbMaxLen1 <= cbActualLen1) ErrorExit("Coordinate param is null."); // Get pointer to Coordinate param pzCoordinateSpec = (char *) srv_paramdata (pSrvProc, 1); if (pzCoordinateSpec == NULL) ErrorExit("Coordinate param is null."); pzCoordinateSpec[cbActualLen1] = 0; // Get OutputVector datatype, and param length information if (srv_paraminfo(pSrvProc, 2, &bType2, &cbMaxLen2, &cbActualLen2, NULL, &fNull2) == FAIL) ErrorExit("Failed to get type info on HTM Vector param..."); The HTM code body

The glue CLR design Discard 450 lines of UGLY code The HTM code body C# SQL sp_HTM (50 lines) using System; using System.Data; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace HTM { public class HTM_wrapper { [DllImport("SQL_HTM.dll")] static extern unsafe void * xp_HTM_Cover_get (byte *str); public static unsafe void HTM_cover_RS(string input) { // convert the input from Unicode (array of 2 bytes) to an array of bytes (not shown) byte * input; byte * output; // invoke the HTM routine output = (byte *)xp_HTM_Cover_get(input); // Convert the array to a table SqlResultSet outputTable = SqlContext.GetReturnResultSet(); if (output[0] == 'O') {// if Output is OK uint c = *(UInt32 *)(s + 4); // cast results as dataset Int64 * r = ( Int64 *)(s + 8); // Int64 r[c-1,2] for (int i = 0; i < c; ++i) { SqlDataRecord newRecord = outputTable.CreateRecord(); newRecord.SetSqlInt64(0, r[0]); newRecord.SetSqlInt64(1, r[1]); r++;r++; outputTable.Insert(newRecord); }} // return outputTable; } } } Thanks!!! To Peter Kukol (who wrote this)

The Clean CLR design Discard all glue code return array cast as table CREATE ASSEMBLY HTM_A FROM '\\localhost\HTM\HTM.dll' CREATE FUNCTION NVARCHAR(100) ) TABLE ( HTM_ID_START BIGINT NOT NULL PRIMARY KEY, HTM_ID_END BIGINT NOT NULL ) AS EXTERNAL NAME HTM_A:HTM_NS.HTM_C::HTM_cover using System; using System.Data; using System.Data.Sql; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace HTM_NS { public class HTM_C { public static Int64[,2] HTM_cover(string input) { // invoke the HTM routine return (Int64[,2]) xp_HTM_Cover(input); // the actual HTM C# or C++ or Java or VB code goes here. } } } Your/My code goes here

Performance (Beta1) On a 2.2 Ghz Xeon Call a Transact SQL function33μs Call a C# function50μs Table valued function μs per row Table valued function 1,580 μs + per row 42 μs Array (== table) valued function 200 μs + per row 27 μs

CREATE ASSEMBLY ReturnOneA FROM '\\localhost\C:\ReturnOne.dll' GO CREATE FUNCTION INT) RETURNS INT AS EXTERNAL NAME ReturnOneA:ReturnOneNS.ReturnOneC::ReturnOne_Int GO time echo an integer float datetime = 0 = = current_Timestamp > 0) begin end = current_Timestamp = datediff(ms, / 10.0 = = current_Timestamp > 0) begin = end = current_Timestamp = datediff(ms, / 10.0 print 'average cpu time for 1,000 calls to ReturnOne_Int was ' + ' micro seconds' The Code using System; using System.Data; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace ReturnOneNS { public class ReturnOneC { public static int ReturnOne_Int(int input) { return input; } Function written in C# inside the DB Program in DB in different language (Tsql) calling function

What Is the Significance? No more inside/outside DB dichotomy. You can put your code near the data. Indeed, we are letting users put personal databases near the data archive. This avoids moving large datasets. Just move questions and answers.

Meta-Message Trying to fit science data into databases When it does not fit, something is wrong. Look for solutions Many solutions come from OR extensions Some are fundamental engine changes More structure in DB Richer operator sets Better statistics

© 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.

evaluations

© 2003 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.