Prototype Web Services Using SDSS DR1

Slides:



Advertisements
Similar presentations
Trying to Use Databases for Science Jim Gray Microsoft Research
Advertisements

Web Services for the Virtual Observatory Alex Szalay, Tamas Budavari, Tanu Malik, Jim Gray, and Ani Thakar SPIE, Hawaii, 2002 (Living in an exponential.
May IVOA Interop Meeting1 STScI/JHU Registry Status Gretchen Greene Wil OMullane T HE US N ATIONAL V IRTUAL O BSERVATORY.
September 13, 2004NVO Summer School1 VO Protocols Overview Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL O BSERVATORY.
September 13, 2004NVO Summer School1 VO Protocols Overview Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL O BSERVATORY.
2008 NVO Summer School1 Finding Services in the NVO Registry Gretchen Greene T HE US N ATIONAL V IRTUAL O BSERVATORY.
Aus-VO Workshop 2003 International Virtual Observatory Alliance effort on Virtual Observatory Query Language Naoki Yasuda (JVO), VOQL WG.
Remote Visualisation System (RVS) By: Anil Chandra.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
László Dobos 1,2, Tamás Budavári 2, Nolan Li 2, Alex Szalay 2, István Csabai 1 1 Eötvös Loránd University, Budapest,
Web + VO + Database Technologies = HLA Footprints STScI: Gretchen Greene, Steve Lubow, Brian McLean, Rick White and the HLA Team JHU: Alex Szalay and Tamas.
SDSS Web Services Tamás Budavári Johns Hopkins University Coding against the Universe.
1 Chapter Overview Transferring and Transforming Data Introducing Microsoft Data Transformation Services (DTS) Transferring and Transforming Data with.
GALEXView Demo T. Rogers, B. Shiao, P. Brown, P. McCauley, A. Conti, M. Smith, S. Tseng, A. Volpicelli StSci/MAST.
T HE I NTERNATIONAL V IRTUAL O BSERVATORY ALLIANCE VAO Registry Relational Schema: Updates and New Interface(s) Theresa Dower Registry WG 16 May 2013 IVOA.
Concept demo System dashboard. Overview Dashboard use case General implementation ideas Use of MULE integration platform Collection Aggregation/Factorization.
Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.
László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug , 2008.IDIES Inaugural Symposium, Baltimore1.
Astronomical Data Query Language Simple Query Protocol for the Virtual Observatory Naoki Yasuda 1, William O'Mullane 2, Tamas Budavari 2, Vivek Haridas.
Dec 2, 2014 MAST Data Discovery Portal Tom Donaldson Tony Rogers.
Presenting Statistical Data Using XML Office for National Statistics, United Kingdom Rob Hawkins, Application Development.
Functions and Demo of Astrogrid 1.1 China-VO Haijun Tian.
EÖTVÖS UNIVERSITY BUDAPEST Department of Physics of Complex Systems VO Spectroscopy Workshop, ESAC Spectrum Services 2007 László Dobos (ELTE)
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Making FITS available in.NET and its Applications Vivek Haridas 1, Tamas Budavari 1, William O'Mullane 1, Alex Szalay 1, Alberto Conti 2, Bill Pence 3,
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Composing workflows in the environmental sciences using Web Services and Inferno Jon Blower, Adit Santokhee, Keith Haines Reading e-Science Centre Roger.
Web Services for the National Virtual Observatory Tamás Budavári Johns Hopkins University.
May 17, 2005Maria Nieto-Santisteban, JHU / IVOA - Kyoto1 VO JHU Open SkyQuery and more … T. Budavari, S. Carliles, L. Dobos, G. Fekete,
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Distributed Archives Interoperability Cynthia Y. Cheung NASA Goddard Space Flight Center IAU 2000 Commission 5 Manchester, UK August 12, 2000.
Technology Drill Down: Windows Azure Platform Eric Nelson | ISV Application Architect | Microsoft UK |
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Spatial Searches in the ODM. slide 2 Common Spatial Questions Points in region queries 1.Find all objects in this region 2.Find all “good” objects (not.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
Developing Online Tools To Support The Visualization Of Ocean Data For Educational Applications Poster #1767 Michael Mills, S. Lichtenwalner,
datalibweb – Stata module to access micro data
SharePoint 101 – An Overview of SharePoint 2010, 2013 and Office 365
Top 8 Best Programming Languages To Learn
Chapter 9: The Client/Server Database Environment
Module 11: File Structure
LOCO Extract – Transform - Load
Open Source distributed document DB for an enterprise
Node.js Express Web Services
Spark Presentation.
Existing Perl/Oracle Pipeline
GLAST Release Manager Automated code compilation via the Release Manager Navid Golpayegani, GSFC/SSAI Overview The Release Manager is a program responsible.
The Client/Server Database Environment
Cross-matching the sky with database server cluster
Haritha Dasari Josue Balandrano Coronel -
Sky Query: A distributed query engine for astronomy
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Part of the Multilingual Web-LT Program
Accelerate Your Self-Service Data Analytics
SDMX Reference Infrastructure Introduction
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Intermountain West Data Warehouse
Overview of big data tools
MAPREDUCE TYPES, FORMATS AND FEATURES
Google Sky.
Efficient Catalog Matching with Dropout Detection
Introduction to Dataflows in Power BI
SSIS Data Integration Data Warehouse Acceleration
SSIS Data Integration Data Warehouse Acceleration
Why is it important? A first cut at a logging interface
WCF Data Services and Silverlight
SSIS Data Integration Data Warehouse Acceleration
Map Reduce, Types, Formats and Features
SDMX IT Tools SDMX Registry
Presentation transcript:

Prototype Web Services Using SDSS DR1 Alex Szalay, Tamas Budavari, Sam Carlisle, Jim Gray, Vivek Haridas, Nolan Li, Tanu Malik, Maria Nieto-Santisteban, Wil O’Mullane, Ani Thakar

NVO: How Will It Work? Define commonly used ‘core’ services Build higher level toolboxes/portals on top We do not build ‘everything for everybody’ Use the 90-10 rule: Define the standards and interfaces Build the framework Build the 10% of services that are used by 90% Let the users build the rest from the components

Using SDSS DR1 SDSS DR1 (Data Release1) is now publicly available http://skyserver.pha.jhu.edu/dr1/ About 1TB of catalog data Using MS SQL Server 2000 Complex schema (72 Tables) About 80 million photometric objects Two versions (TARGET/BEST) Automated documentation Raw data at FNAL file server with URL access

Loading DR1 Automated table driven workflow system for loading Included lots of verification code Over 16K lines of SQL code Loading process was extremely painful Lack of systems engineering for the pipelines Poor testing (lots of foreign key mismatch) Detected data bugs even a month ago Most of the time spent on scrubbing data Fixing corrupted files (RAID5 disk errors) Once data was clean, everything loaded in 3 days Neighbors calculation took about 10 hours Reorganization of data took about 1 week of experiments in partitioning/layouts

Reorganization Introduced partitions and filegroups Photo, Tag, Neighbors, Spectro, Frame, Other, Profiles Keep partitions under 100GB Vertical partitioning – tried and abandoned Both partitioning and index build now table driven Stored procedures to create/drop indices at various granularities Tremendous improvement in performance when doing this on a large memory machine (24GB) Also much better performance afterwards

Spatial Features Precomputed Neighbors Boundaries, Masks and Outlines All objects within 30” Boundaries, Masks and Outlines Stored as spatial polygons Time Domain: Precomputed Match All objects with 1”, observed at different times Found duplicates due to telescope tracking errors Manual fix, recorded in the database MatchHead The first observation of the linked list used as unique id to chain of observations of the same object

Spatial Algorithms Updated HTM library Zones Automated depth for HTM_Cover Output vertices Simplify polygon Boolean operations on regions Part of VO data model (A. Rots) Zones Much better performance for bulk neighbors at a fixed radius Footprint service in progress Bool Contains(point) Region Intersect(region)

Web Services in Progress Registry Harvesting and querying Data Delivery Query driven Queue management Graphics and visualization Query driven vs interactive Show spatial objects (Chart/Navi/List) Footprint/intersect It is a “fractal” Cross-matching SkyQuery and SkyNode Ferris-wheel Distributed vs parallel

Registry: Easy Clients Just use SOAP toolkit (T. McGlynn & J. Lee have done Perl client). Easy in Java java org.apache.axis.wsdl.WSDL2Java "http://skyservice.pha.jhu.edu/devel/registry/registry.asmx?wsdl" Gives set of Classes for accessing the service Gives Classes for the XML which is returned (i.e. SimpleResource) Still need to write client like RegistryLocator loc = new RegistryLocator(); RegistrySoap reg = loc.getRegistrySoap(); ArrayOfSimpleResource reses = null; reses = reg.queryRegistry(args[0]); http://skyservice.pha.jhu.edu/devel/registry/index.aspx

Generic Catalog Access After 2 years of SDSS EDR and 6 months of DR1 usage, access patterns start to emerge Lots of small users, requiring instant response 1/f distribution of request sizes (tail of the lognormal) How to make everybody happy? No clear business model… We need a separate interactive and batch server We also need access to full SQL with extensions Users want to access services via browsers Other services will need SOAP access

Data Formats Different data formats requested: HTML, CSV, FITS binary, VOTABLE, XML, graphics Quick browsing and exploration Small requests, need to be nicely rendered Needs good random access performance Also simple 2D scatter plots or density plots required Heavy duty statistical use Aggregate functions on complex joins, lots of scans but small output, mostly want CSV Successive Data Filter Multi-step non-indexed filtering of the whole database, mostly want FITS binary

Data Delivery Small requests (<100MB) Medium requests (<1GB) Putting data on the stream Medium requests (<1GB) Use DIME attachments to SOAP messages Large requests (>1GB) Save data in scratch area and use asynch delivery Only practical for large/long queries Iterative requests Save data in temp tables in user space Let user manipulate via web browser Paradox: if we use web browser to submit, users want immediate response from batch-size queries

How To Provide a UserDB Goal: through several search/filter operations reduce data transfer to manageable sizes (1-100MB) Today: people download tens of millions of rows, and then do their next filtering on client side, using F77 Could be much better done in the database But: users need to create/manage temporary tables DOS attacks, fragmentation, who pays for it Security, who can see my data (group access)? Follow progress of long jobs Who does the cleanup?

Query Managament Service Enable fast, anonymous access to small requests Enable large queries, with ability to manage Enable creation of temporary tables in user space Create multiple ways to get query output Needs to support multiple mirrors/load balancing Do all this without logging in to Windows Need also support of machine clients Web Service: http://skyservice.pha.jhu.edu/devel/CasJobs/ Two request categories: Quick Batch

Queue Management Need to register batch ‘power users’ Query output goes to ‘MyDB’ Can be joined with source database Results are materialized from MyDB upon request Users can do: Insert, Drop, Create, Select Into, Functions, Procedures Publish their tables to a group area Data delivery via the CASService (C# WS) http://skyservice.pha.jhu.edu/devel/CasService/CasService.asmx

Graphics Tools Simple xy plots http://skyservice.pha.jhu.edu/nli/wplot/ Density plot http://skyservice.pha.jhu.edu/devel/DensityMap/AllSkyView.aspx http://skyservice.pha.jhu.edu/devel/DensityMap/PlotQuery.aspx Chart/Navi/List http://skyservice.pha.jhu.edu/dr1/imgcutout/getjpeg.asmx Can be built into various applications

Archive Footprint Footprint is a ‘fractal’ Result depends on context all sky, degree scale, pixel scale Translate to web services Footprint() returns single region that contains the archive Intersection(region, tolerance) feed a region and returns the intersection with archive footprint Contains(point) returns yes/no (maybe fuzzy) if point is inside archive footprint

Cross-Matching SkyQuery – SkyNode Currently lots of proprietary features Data transmitted via .NET DataSet => VOTable Query plan written in MS T-SQL => ADQL Spatial operator restricted to a cone =>VORegion Made up metadata delivery => VORegistry Data delivery in XML/HTML => VOTable Catalogs in the near future SDSS DR1, FIRST, 2MASS, INT POSS-1, GSC-2, HST, ROSAT, 2dF GALEX, IRAS, PSCZ

Spatial Cross-Match For small area HTM is close to optimal, but needs more speed For all-sky surveys the zone algorithm is best Current heuristic is a linear chain of all nodes Easy to generalize to include precomputed neighbors But, for all sky queries very large number of random reads instead of sequential

Ferris-Wheel Sky split into buckets/zones All archives scan in sync Queries enter at bottom Results come back after full circle Only sequential access => buckets get into cache, then queries processed Portal SDSS

Utilitites FITSLIB 1.10 C# library around the CFITSIO package http://www.cs.jhu.edu/~haridas/tech/Fits/ MIRAGE Java wrapper around Mirage, can directly access the VORegistry, and ConeSearch http://skyservice.pha.jhu.edu/develop/vo/mirage/mirage.html HTM2.0 Updated HTM library, conforming to the new Region specification http://www.sdss.jhu.edu/htm/ ADQL Prototype service to convert back and forth between ADQL and SQL http://skyservice.pha.jhu.edu/vivek/msdev/AstroDql/ws/ http://skyservice.pha.jhu.edu/vivek/msdev/AstroDql/ws/Archive.asmx SDSSQA Java application, emulating MS Query Analyzer

Summary Web Services have been remarkably easy to use Now different platforms are interoperable We have invested a lot of energy to develop various interface libraries (FITS, VOTable) Integrating graphics into web services was very easy Next: Parallel queries Finish query queue management Upgrade SkyQuery Bring in more archives Ferris-Wheel experiment On-demand database creation 100TB parallel data access layer

http://skyservice.pha.jhu.edu/develop/