Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jim Gray Alex Szalay SLAC Data Management Workshop

Similar presentations


Presentation on theme: "Jim Gray Alex Szalay SLAC Data Management Workshop"— Presentation transcript:

1 Jim Gray Alex Szalay SLAC Data Management Workshop
Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay SLAC Data Management Workshop

2 The Evolution of Science
Observational Science Scientist gathers data by direct observation Scientist analyzes data Analytical Science Scientist builds analytical model Makes predictions. Computational Science Simulate analytical model Validate model and makes predictions Data Exploration Science Data captured by instruments Or data generated by simulator Processed by software Placed in a database / files Scientist analyzes database / files

3 Information Avalanche
In science, industry, government,…. better observational instruments and and, better simulations producing a data avalanche Examples BaBar: Grows 1TB/day 2/3 simulation Information 1/3 observational Information CERN: LHC will generate 1GB/s .~10 PB/y VLBA (NRAO) generates 1GB/s today Pixar: 100 TB/Movie New emphasis on informatics: Capturing, Organizing, Summarizing, Analyzing, Visualizing Image courtesy C. Meneveau & A. Szalay @ JHU BaBar, Stanford P&E Gene Sequencer From Space Telescope

4 ? The Big Picture The Big Problems Query and Vis tools
Experiments & Instruments facts questions facts ? Other Archives facts answers Literature facts Simulations The Big Problems Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it How to coexist with others Query and Vis tools Support/training Performance Execute queries in a minute Batch query scheduling

5 FTP - GREP Download (FTP and GREP) are not adequate
You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. Oh!, and 1PB ~3,000 disks At some point we need indices to limit search parallel data search and analysis This is where databases can help Next generation technique: Data Exploration Bring the analysis to the data!

6 The Speed Problem Many users want to search the whole DB ad hoc queries, often combinatorial Want ~ 1 minute response Brute force (parallel search): 1 disk = 50MBps => ~1M disks/PB ~ 300M$/PB Indices (limit search, do column store) 1,000x less equipment: 1M$/PB Pre-compute answer No one knows how do it for all questions.

7 Next-Generation Data Analysis
Looking for Needles in haystacks – the Higgs particle Haystacks: Dark matter, Dark energy Needles are easier than haystacks Global statistics have poor scaling Correlation functions are N2, likelihood techniques N3 As data and computers grow at same rate, we can only keep up with N logN A way out? Relax notion of optimal (data is fuzzy, answers are approximate) Don’t assume infinite computational resources or memory Combination of statistics & computer science

8 Analysis and Databases
Much statistical analysis deals with Creating uniform samples – data filtering Assembling relevant subsets Estimating completeness censoring bad data Counting and building histograms Generating Monte-Carlo subsets Likelihood calculations Hypothesis testing Traditionally these are performed on files Most of these tasks are much better done inside a database Move Mohamed to the mountain, not the mountain to Mohamed.

9 Organization & Algorithms
Use of clever data structures (trees, cubes): Up-front creation cost, but only N logN access cost Large speedup during the analysis Tree-codes for correlations (A. Moore et al 2001) Data Cubes for OLAP (all vendors) Fast, approximate heuristic algorithms No need to be more accurate than cosmic variance Fast CMB analysis by Szapudi et al (2001) N logN instead of N3 => 1 day instead of 10 million years Take cost of computation into account Controlled level of accuracy Best result in a given time, given our computing resources

10 World Wide Telescope Virtual Observatory http://www.ivoa.net/
Premise: Most data is (or could be online) The Internet is the world’s best telescope: It has data on every part of the sky In every measured spectral band: optical, x-ray, radio.. As deep as the best instruments (2 years ago). It is up when you are up. The “seeing” is always great (no working at night, no clouds no moons no..). It’s a smart telescope: links objects and data to literature on them.

11 Why Astronomy? Data is real and well documented
Community has lots of data Data is real and well documented High-dimensional (with confidence intervals) Spatial, temporal Diverse and distributed Many different instruments from many different places and many different times Community wants to share/cross compare Can freely share data and algorithms. “DataMining, Not Data MINE!!” Mark Ellisman, UCSD They are well organized Community is small and homogeneous No commercial or privacy concerns All the problems are technical or social.

12 The WWT Components Data Sources Unified Definitions Object model
Literature Archives Unified Definitions Units, Semantics/Concepts/Metrics, Representations, Provenance Object model Classes and methods Portals

13 Data Sources Literature online and cross indexed
Simbad, ADS, NED, Many curated archives online FIRST, DPOSS, 2MASS, USNO, IRAS, SDSS, VizeR,… Typically files with English meta-data and some programs Groups, Researchers, Amateurs Publish Datasets online in various formats Data publications are ephemeral (may disappear) Many have unknown provenance Documentation varies; some good and some none.

14 Unified Definitions Universal Content Definitions Collated all table heads from all the literature 100,000 terms reduced to ~1,500 Rough consensus that this is the right thing. Refinement in progress as people use UCDs Defines Units: gram, radian, second, janski... Semantic Concepts / Metrics Std error, Chi2 fit, magnitude, passband, velocity,

15 Provenance Most data will be derived.
To do science, need to trace derived data back to source. So programs and inputs must be registered. Must be able to re-run them. Example: Space Telescope Calibrated Data Run on demand Can specify software version (to get old answers) Scientific Data Provenance and Curation are largely unsolved problems (some ideas but no science).

16 Object Model Your General acceptance of XML program http Web
Server General acceptance of XML Recent acceptance of XML Schema (XSD over DTD) Wait-and-See about SOAP/WSDL/… “ Web Services are just Corba with angle brackets.” FTP is good enough for me. Personal opinion: Web Services are much more than “Corba + <>” Huge focus on interop Huge focus on integrated tools But the community says “Show me!” Many technologists convinced, but not yet the astronomers http Web page Your program Web Service soap Data In your address space object in xml

17 Classes and Methods Your program Data In your address space Web Service soap object in xml First Class: VO table Represents an answer set in XML Defined by an XML Schema (XSD) Metadata (in terms of UCDs) Data representation (numbers and text) First method Cone Search: Get objects in this cone

18 Other Classes Space-Time class Image Class (returns pixels) Spectral
Your program Data In your address space Web Service soap object in xml Space-Time class Image Class (returns pixels) SdssCutout Simple Image Access Protocol HyperAtlas Spectral Simple Spectral Access Protocol 500K spectra available at Query Services ADQL and SkyNode And Registry: see below

19 The Registry UDDI seemed inappropriate Evolved Dublin Core Complex
Irrelevant questions Relevant questions missing Evolved Dublin Core Represent Datasets, Services, Portals Needs to be machine readable Federation (DNS model) Push & Pull: register then harvest

20 Demo SkyServer: SkyQuery: navigator showing cutout web service
List: showing many calls and variant use. SkyQuery: Show integration of various archives. Explain spatial join xMatch operator.

21 SkyServer.SDSS.org A modern Astronomy archive Also used for education
Raw Pixel data lives in file servers Catalog data (derived objects) lives in Database Online query to any and all Also used for education 150 hours of online Astronomy Implicitly teaches data analysis Interesting things Spatial data search Client query interface via Java Applet Query interface via Emacs Popular Cloned by other surveys (a template design) Web services are core of it.

22 SkyQuery A Prototype WWT
Started with SDSS data and schema Imported12 other datasets into that spine schema. (a day per dataset plus load time) Unified them with a portal Implicit spatial join among the datasets. All built on Web Services Pure XML Pure SOAP Used .NET toolkit

23 Federation: SkyQuery.Net
Combine 4 archives initially Added 9 more Send query to portal, portal joins data from archives. Problem: want to do multi-step data analysis (not just single query). Solution: Allow personal databases on portal Problem: some queries are monsters Solution: “batch schedule” on portal server, Deposits answer in personal database.

24 SkyQuery Structure Portal is Plans Query (2 phase) Integrates answers
Is a web service Each SkyNode publishes Schema Web Service Database Web Service 2MASS INT SDSS FIRST SkyQuery Portal Image Cutout

25 MyDB http://skyservice.pha.jhu.edu/devel/casjobs/
Portal allows federation of data but… Intermediate results may be large. Intermediate results feed into next analysis step. Sending them back-and-forth to client is costly and sometimes infeasible. Solution: create a working DB for client at Portal: MyDB

26 MyDB http://skyservice.pha.jhu.edu/devel/casjobs/
Anyone can create a personal DB at SkyServer portal. It is about 100 MB It is private Simple queries done immediately Complex queries done by batch scheduler All queries can create/read/write MyDB tables Very popular with “serious” users. MyDB will be sharable with by a group.

27 Open SkyQuery SkyQuery being adopted by AstroGrid as reference implementation for OGSA-DAI (Open Grid Services Architecture, Data Access and Integration). SkyNode basic archive object SkyQuery Language (VoQL) is evolving.

28 The WWT Components Outline What we learned Data Sources
Literature Archives Unified Definitions Units, Semantics/Concepts/Metrics, Representations, Provenance Object model Classes and methods Portals WWT is a poster child for the Data Grid. What we learned Astro is a community of 10,000 Homogenous & Cooperative If you can’t do it for Astro, do not bother with 3M bio-info. Agreement Takes time Takes endless meetings Big problems are non-technical Legacy is a big problem. Plumbing and tools are there But… What is the object model? What do you want to save? How document provenance?


Download ppt "Jim Gray Alex Szalay SLAC Data Management Workshop"

Similar presentations


Ads by Google