Promised Abstract Can database technology help manage and mine scientific data? That is the question I have been trying to answer with my astronomy colleagues (especially Alex Szalay.) We have had some success but still face many problems. I will start by describing the odyssey of putting the Sloan Digital Sky Survey online and give some statistics about how it is used and what we are doing now. That segment will end with a discussion of how the integration of SQL with the CLR (common language run time) makes it much easier for us to handle scientific datatypes and spatial access methods. The World-Wide Telescope is an attempt to federate all the astronomy archives of the world. I will briefly describe the architecture of SkyQuery a prototype portal to several archives each of which is a web service. Urls: Can database technology help manage and mine scientific data? That is the question I have been trying to answer with my astronomy colleagues (especially Alex Szalay.) We have had some success but still face many problems. I will start by describing the odyssey of putting the Sloan Digital Sky Survey online and give some statistics about how it is used and what we are doing now. That segment will end with a discussion of how the integration of SQL with the CLR (common language run time) makes it much easier for us to handle scientific datatypes and spatial access methods. The World-Wide Telescope is an attempt to federate all the astronomy archives of the world. I will briefly describe the architecture of SkyQuery a prototype portal to several archives each of which is a web service. Urls:
Session Code Yukon Features For SkyServer Database Jim Gray: Microsoft Alex Szalay (and friends): Johns Hopkins Help from: Cathan Cook (personal SkyServer), Maria A. Nieto-Santisteban (image cutout service)
SkyServer Overview (10 min) 10 minute SkyServer tour Pixel space Record space: Doc space: Ned Set space: Web & Query Logs Dr1 WebService You can download (thanks to Cathan Cook ) Data + Database code: Website: Data Mining the SDSS SkyServer Database Data Mining the SDSS SkyServer Database MSR-TR Data Mining the SDSS SkyServer Database select top 10 * from weblog..weblog where yy = 2003 and mm=7 and dd =25 order by seq desc select top 10 * from weblog..sqlLog order by theTime Desc
Cutout Service (10 min) A typical web service Show it Show WSDL Show fixing a bug Rush through code. You can download it. Maria A. Nieto-Santisteban did most of this (Alex and I started it)
SkyQuery: Distributed Query tool using a set of web services Fifteen astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England)… Feasibility study, built in 6 weeks Tanu Malik (JHU CS grad student) Tamas Budavari (JHU astro postdoc) With help from Szalay, Thakar, Gray Implemented in C# and.NET Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2
2MASS INT SDSS FIRST SkyQuery Portal Image Cutout SkyQuery Structure Each SkyNode publishes Schema Web Service Database Web Service Portal is Plans Query (2 phase) Integrates answers Is itself a web service
Four Database Topics Sparse tables: column vs row store tag and index tables pivot Maplist (cross apply) Bookmark bug Object Relational has arrived.
Column Store Pyramid Users see fat base tables (universal relation) Define popular columns index tag table 10% ~ 100 columns Make many skinny indices 1% ~ 10 columns Query optimizer picks right plan Automate definition & use Fast read, slow insert/update Data warehouse Note: prior to Yukon, index had 16 column limit. A bane of my existence. Simpl e Typical Semi- join Fat quer y Obese query BASE INDICIES TAG
Examples create table base ( id bigint, f1 int primary key, f2 int, …,f1000 int) create index tag on base (id) include (f1, …, f100) create index skinny on base(f2,…f17) Simpl e Typical Semi-join Fat quer y Obese query BASE INDICIES TAG
A Semi-Join Example create table fat(a int primary key, b int, c int, fat char (988)) int; = 0 again: insert fat cast(100*rand() as int), cast (100*rand() as int), ' ') + 1; if < ) goto again create index ab on fat(a,b) create index ac on fat(a,c) dbcc dropcleanbuffers with no_infomsgs select count(*) from fat with(index (0)) where c = b -- Table 'fat'. Scan 3, reads 137,230, CPU : 1.3 s, elapsed 31.1s. dbcc dropcleanbuffers with no_infomsgs select count(*) from fat where b=c -- Table 'fat'. Scan 2, reads: 3,482 CPU 1.1 s, elapsed: 1.4 s. 1GB 8MB b=c 3.4K IO 1.4 sec abac b=c 137 K IO 31 sec
Moving From Rows to Columns Pivot & UnPivot What if the table is sparse? LDAP has 7 mandatory and 1,000 optional attributes Store row, col, value create table Features (object varchar, attribute varchar, value varchar, primary key (object, attribute)) select * from (featurespivot value on attribute in (year, color) ) as T where object = 4PNC450 Features object attribute value 4PNC450 year PNC450 color white 4PNC450 make Ford 4PNC450 model Taurus T Object year color 4PNC white
Maplist Meets SQL – cross apply Your table-valued function F(a,b,c) returns all objects related to a,b,c. spatial neighbors, sub-assemblies, members of a group, items in a folder,… Apply this function to each row Classic drill-down use outer apply if f() may be null select p.*, q.* from parent as p cross apply f(p.a, p.b, p.c) as q where p.type = 1 p1 f(p1) p2 f(p2) pn f(pn)
The Bookmark Bug SQL is a non-procedural language. The compiler/optimizer picks the procedure based on statistics. If the stats are wrong or missing…. Bad things happen. Queries can run VERY slowly. Strategy 1: allow users to specify plan. Strategy 2: make the optimizer smarter (and accept hints from the user.)
An Example of the Problem A query selects some fields of an index and of huge table. Bookmark plan: look in index for a subset. Lookup subset in Fat table. This is great if subset << table. terrible if subset ~ table. If statistics are wrong, or if predicates not independent, you get the wrong plan. How to fix the statistics? Index Huge table
A Fix: Let user ask for stats Create Statistics on View(f1,..,fn) Then the optimizer has the right data Picks the right plan. Statistics on Views, C. Galindo-Legaria, M. Josi, F. Waas, M. Wu, VLDB 2003, Q3: Select count(*) from Galaxy where r Bookmark: 34 M random IO, 520 minutes Create Statistics on Galaxy(objID ) Scan: 5 M sequential IO 18 minutes Ultimately this should be automated, but for now,… its a step in the right direction.
Object Relational Has Arrived VMs are moving inside the DB Yukon includes Common Language Runtime (Oracle & DB2 have similar mechanisms). So, C++, VB, C# and Java are co-equal with TransactSQL. You can define classes and methods SQL will store the instances Access them via methods You can put your analysis code INSIDE the database. Minimizes data movement. You cant move petabytes to the client But we will soon have petabyte databases. data code data code +code
And.. Fully-async and synchronous (blocking) calls and multi-concurrent-result sets per connection (transaction) Queues built in (service broker): Fire-and forget asynchronous processing It listens to Port 80 for SOAP calls : TP-lite is back Its a web service Notification service and data mining and olap and reporting and xml and xquery and.... ) But, back to OR.
Some Background Table valued functions SQL operates on tables. If you can make tables, you can extend SQL This is the idea behind OLE/DB create function int) table (a int) begin while > 0) begin if % 2 = 0) -1 end return end select * from Evens(10) a
Using table Valued Functions For Spatial Search Use function to return likely key ranges. Use filter predicate to eliminate objects outside the query box. Select objID From Objects O @radius) R on O.htmID between R.begin and R.end where abs(o.Lat + abs(o.Lon Table valued function returns candidate ranges of some space-filling curve. Filter discards false positives.
The Pre CLR design Transact SQL sp_HTM (20 lines) 469 lines of glue looking like: // Get Coordinates param datatype, and param length information of if (srv_paraminfo(pSrvProc, 1, &bType1, &cbMaxLen1, &cbActualLen1, NULL, &fNull1) == FAIL) ErrorExit("srv_paraminfo failed..."); // Is Coordinate param a character string if (bType1 != SRVBIGVARCHAR && bType1 != SRVBIGCHAR && bType1 != SRVVARCHAR && bType1 != SRVCHAR) ErrorExit("Coordinate param should be a string."); // Is Coordinate param non-null if (fNull1 || cbActualLen1 < 1 || cbMaxLen1 <= cbActualLen1) ErrorExit("Coordinate param is null."); // Get pointer to Coordinate param pzCoordinateSpec = (char *) srv_paramdata (pSrvProc, 1); if (pzCoordinateSpec == NULL) ErrorExit("Coordinate param is null."); pzCoordinateSpec[cbActualLen1] = 0; // Get OutputVector datatype, and param length information if (srv_paraminfo(pSrvProc, 2, &bType2, &cbMaxLen2, &cbActualLen2, NULL, &fNull2) == FAIL) ErrorExit("Failed to get type info on HTM Vector param..."); The HTM code body
The glue CLR design Discard 450 lines of UGLY code The HTM code body C# SQL sp_HTM (50 lines) using System; using System.Data; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace HTM { public class HTM_wrapper { [DllImport("SQL_HTM.dll")] static extern unsafe void * xp_HTM_Cover_get (byte *str); public static unsafe void HTM_cover_RS(string input) { // convert the input from Unicode (array of 2 bytes) to an array of bytes (not shown) byte * input; byte * output; // invoke the HTM routine output = (byte *)xp_HTM_Cover_get(input); // Convert the array to a table SqlResultSet outputTable = SqlContext.GetReturnResultSet(); if (output[0] == 'O') {// if Output is OK uint c = *(UInt32 *)(s + 4); // cast results as dataset Int64 * r = ( Int64 *)(s + 8); // Int64 r[c-1,2] for (int i = 0; i < c; ++i) { SqlDataRecord newRecord = outputTable.CreateRecord(); newRecord.SetSqlInt64(0, r[0]); newRecord.SetSqlInt64(1, r[1]); r++;r++; outputTable.Insert(newRecord); }} // return outputTable; } } } Thanks!!! To Peter Kukol (who wrote this)
The Clean CLR design Discard all glue code return array cast as table CREATE ASSEMBLY HTM_A FROM '\\localhost\HTM\HTM.dll' CREATE FUNCTION NVARCHAR(100) ) TABLE ( HTM_ID_START BIGINT NOT NULL PRIMARY KEY, HTM_ID_END BIGINT NOT NULL ) AS EXTERNAL NAME HTM_A:HTM_NS.HTM_C::HTM_cover using System; using System.Data; using System.Data.Sql; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace HTM_NS { public class HTM_C { public static Int64[,2] HTM_cover(string input) { // invoke the HTM routine return (Int64[,2]) xp_HTM_Cover(input); // the actual HTM C# or C++ or Java or VB code goes here. } } } Your/My code goes here
Performance (Beta1) On a 2.2 Ghz Xeon Call a Transact SQL function33μs Call a C# function50μs Table valued function μs per row Table valued function 1,580 μs + per row 42 μs Array (== table) valued function 200 μs + per row 27 μs
CREATE ASSEMBLY ReturnOneA FROM '\\localhost\C:\ReturnOne.dll' GO CREATE FUNCTION INT) RETURNS INT AS EXTERNAL NAME ReturnOneA:ReturnOneNS.ReturnOneC::ReturnOne_Int GO time echo an integer float datetime = 0 = = current_Timestamp > 0) begin end = current_Timestamp = datediff(ms, / 10.0 = = current_Timestamp > 0) begin = end = current_Timestamp = datediff(ms, / 10.0 print 'average cpu time for 1,000 calls to ReturnOne_Int was ' + ' micro seconds' The Code using System; using System.Data; using System.Data.SqlServer; using System.Data.SqlTypes; using System.Runtime.InteropServices; namespace ReturnOneNS { public class ReturnOneC { public static int ReturnOne_Int(int input) { return input; } Function written in C# inside the DB Program in DB in different language (Tsql) calling function
What Is the Significance? No more inside/outside DB dichotomy. You can put your code near the data. Indeed, we are letting users put personal databases near the data archive. This avoids moving large datasets. Just move questions and answers.
Meta-Message Trying to fit science data into databases When it does not fit, something is wrong. Look for solutions Many solutions come from OR extensions Some are fundamental engine changes More structure in DB Richer operator sets Better statistics
© 2002 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.
evaluations
© 2003 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.