Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

Trying to Use Databases for Science Jim Gray Microsoft Research
World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Global Hands-On Universe meeting July 15, 2007 Authentic Data in the Classroom with the Sloan Digital Sky Survey Jordan Raddick (Johns Hopkins University)
Development of China-VO ZHAO Yongheng NAOC, Beijing Nov
Cooking with Sloan: Special DR6 Recipes Jordan Raddick, Ani Thakar Alex Szalay, Maria Nieto-Santisteban, Nolan Li, Wil O’Mullane, Adrian Pope, Tamas Budavari,
National Center for Supercomputing Applications 259 th fastest computer in the world Michael Remijan NCSA –Research Programmer –Web-based Distributed Programming.
CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University.
Interactive Narrative Content and Context for Visualization Curtis Wong Group Manager Next Media Research Microsoft Research Visualization of Astrophysical.
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Interpret Application Specifications
SDSS Web Services Tamás Budavári Johns Hopkins University Coding against the Universe.
Teaching Science with Sloan Digital Sky Survey Data GriPhyN/iVDGL Education and Outreach meeting March 1, 2002 Jordan Raddick The Johns Hopkins University.
Data provenance in astronomy Bob Mann Wide-Field Astronomy Unit University of Edinburgh
Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.
Ch 4. The Evolution of Analytic Scalability
Digitized Sky Survey Update Brian McLean : Archive Sciences Branch / Operations and Engineering Division.
Overview of SQL Server Alka Arora.
Data Management Subsystem: Data Processing, Calibration and Archive Systems for JWST with implications for HST Gretchen Greene & Perry Greenfield.
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
WSRF Supported Data Access Service (VO-DAS)‏ Chao Liu, Haijun Tian, Dan Gao, Yang Yang, Yong Lu China-VO National Astronomical Observatories, CAS, China.
Functions and Demo of Astrogrid 1.1 China-VO Haijun Tian.
Innovations in the Multimission Archive at STScI (MAST) M. Corbin, M. Donahue, C. Imhoff, T. Kimball, K. Levay, P. Padovani, M. Postman, M. Smith, R. Thompson.
1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.
Science Archive for Sky Surveys Data Providers and the VO - NeSC 2003 March Wide Field Astronomy Unit Institute for Astronomy.
Astronomical data curation and the Wide-Field Astronomy Unit Bob Mann Wide-Field Astronomy Unit Institute for Astronomy School of Physics University of.
Science with the Virtual Observatory Brian R. Kent NRAO.
Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.
Mining the UNKNOWN in the Sloan Digital Sky Survey Patrick Hall Assistant Professor Dept. of Physics & Astronomy York University
Making the Sky Searchable: Automatically Organizing the World’s Astronomical Data Sam Roweis, Dustin Lang &
EÖTVÖS UNIVERSITY BUDAPEST Department of Physics of Complex Systems VO Spectroscopy Workshop, ESAC Spectrum Services 2007 László Dobos (ELTE)
Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.
Figure 1. Typical QLWFPC2 performance results with two WFPC2 observations of a Local Group globular cluster running on a 5-node Beowulf cluster with 1.8.
G. Fekete, JHU Efficient search indices for geospatial data in a relational database Gyorgy (George) Fekete Dept. Physics and Astronomy Johns Hopkins University.
Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by.
Wiss. Beirat AIP, ClusterFinder & VO-Methods H. Enke German Astrophysical Virtual Observatory ClusterFinder VO Methods for Astronomical Applications.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
EScience May 2007 From Photons to Petabytes: Astronomy in the Era of Large Scale Surveys and Virtual Observatories R. Chris Smith NOAO/CTIO, LSST.
The Sloan Digital Sky Survey ImgCutout: The universe at your fingertips Maria A. Nieto-Santisteban Johns Hopkins University
Eyeballs and automation in pursuit of moons (and limiting magnitudes) Max Mutchler Research & Instrument Scientist Space Telescope Science Institute Nix.
August 10, 2004 Apache Point Observatory, NM FINDING SUPERNOVAE IN A SLICE OF PI Dennis J. Lamenti San Francisco State University.
PACS Hitchhiker’s Guide to Herschel Archive Workshop – Pasadena 6 th - 10 th Oct 2014 The PACS Spectrometer: Overview and Products Roberta Paladini NHSC/IPAC.
Virtual Observatories, Press Release Images, and Web Services Dr. Frank Summers Space Telescope Science Institute November 3, 2005.
AstroGrid NAM 2001 Andy Lawrence Cambridge NAM 2001 Andy Lawrence Cambridge Belfast Cambridge Edinburgh Jodrell Leicester MSSL.
EScience: Techniques and Technologies for 21st Century Discovery Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering Computer Science.
Grids 2003 The Great Academia/Industry Grid Debate Dan Fay | Microsoft Research Grid, grid, everywhere a Grid Blocking out the scenery, breaking my mind.
1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
January 23, 2016María Nieto-Santisteban – AISRP 2003 / Pittsburgh1 High-Speed Access for an NVO Data Grid Node María A. Nieto-Santisteban, Aniruddha R.
EVLA Data Processing PDR E2E Data Archive System John Benson, NRAO July 18, 2002.
1 Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research Alex.
Lecture 3 With every passing hour our solar system comes forty-three thousand miles closer to globular cluster 13 in the constellation Hercules, and still.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
TELEPORT PRO Website to Hard Drive Completely download a website, enabling you to “Browse Offline” at much greater speeds than if you were to browse the.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.
Moving towards the Virtual Observatory Paolo Padovani, ST-ECF/ESO
Planning Observations
BARC Scaleable Servers
Rick, the SkyServer is a website we built to make it easy for professional and armature astronomers to access the terabytes of data gathered by the Sloan.
Ch 4. The Evolution of Analytic Scalability
Jim Gray Researcher Microsoft Research
Observing with Modern Observatories (the data flow)
Google Sky.
Presentation transcript:

Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Case Study: Astronomy In the “old days” astronomers took photos. New instruments are digital (< 100 GB/night) Detectors are following Moore’s law. Data avalanche: double every 2 years Total area of 3m+ telescopes in the world in m 2, total number of CCD pixels in Megapix, as a function of time. Growth over 25 years is a factor of 30 in glass, 3000 in pixels.

Why astronomy? Because “it’s worthless” (Jim Gray) –No privacy restrictions –No intellectual property –No one wants to sell you data Because it’s intrinsically interesting And because there’s a lot of it

Astronomy Data Astronomers have a few Petabytes now Data doubles every 2 years Data is public after 2 years So, 50% of the data is public But… how do I get at that 50% of the data?

Science is breaking You can… –search 1 MB in 1 second, transfer for < 1¢ –search 1 GB in 1 minute, transfer for $1 –search 1 TB in 2 days, transfer for $1,000 –search 1 PB in 3 years, transfer for $1,000,000 –…and 1 PB is 5,000 disks “Data avalanche” in all fields of science –New approach needed

Data Intensive Science Tools Databases instead of files (DBMSs) –Data integrity ensured –Optimized data access (DB indices) –Ability to define methods on data –Do the science inside the database! –Bring the analysis to the data, not vice-versa Scalable (parallel) data access High-speed transport protocols –Move the data rapidly when necessary Asynchronous Web services –Web browsers cannot handle data volumes

Gray’s Laws of Data Engineering Jim Gray –Scientific computing is revolving around data –Need scale-out solution for analysis –Take the analysis to the data! –Start with “20 queries” use cases

Virtual Observatory Premise: Most data is (or could be) online So, the Internet is the world’s best telescope: –It has data on every part of the sky –In every measured spectral band: optical, x-ray, radio.. –As deep as the best instruments (2 years ago). –It is up when you are up. The observing conditions are always great (no working at night, no clouds no moons no..). –It’s a smart telescope: links objects and data to literature on them.

Sloan Digital Sky Survey A map of the universe Telescope at Apache Point Observatory (Sunspot, NM) –2.5 meter reflector –~3 degree field of view –Drift scanning (telescope stationary, sky appears to move)

An Ambitious Survey A thousand-fold increase in the amount of data! Info content > US Library of CongressInfo content > US Library of Congress Before SDSS, total number of galaxies with measured parameters ~ 200kBefore SDSS, total number of galaxies with measured parameters ~ 200k With SDSS, we already have detailed parameters for over 200 Million galaxies!!With SDSS, we already have detailed parameters for over 200 Million galaxies!!

SDSS Data Overview Data Archive Server (DAS) FITS files (raw data) Images, spectra, corrected frames, atlas images, binned images, masks Online form-based access Rsync and wget file retrieval Data Archive Server (DAS) FITS files (raw data) Images, spectra, corrected frames, atlas images, binned images, masks Online form-based access Rsync and wget file retrieval Catalog Archive Server (CAS) Science parameters extracted to catalogs Stuffed into relational DBMS (SQL Server) Heavily indexed, optimized Online access via SkyServer Several levels of access, query tools Catalog Archive Server (CAS) Science parameters extracted to catalogs Stuffed into relational DBMS (SQL Server) Heavily indexed, optimized Online access via SkyServer Several levels of access, query tools SDSS Data Release das.sdss.org/DRx-cgi-bin/DAS das.sdss.org/DRx-cgi-bin/DAS cas.sdss.org skyserver.sdss.org cas.sdss.org skyserver.sdss.org

The CAS Databases Processed data is stuffed into a commercial relational DBMS –Microsoft SQL Server 2000 Allows fast exploration and analysis - Data Mining Two versions of the sky: Best and Target –Target is version of sky on which spectroscopic targets were chosen –Best is latest, greatest processing of the data –2 DBs for each release, e.g. BestDR3 and TargDR3 Heavily indexed to speed up access – HTM + DB Indices Short queries can run interactively. Long queries (> 1 hr) require a custom Batch Query System. SDSS Pipelines

SkyServer Web Access Supports several levels of SQL access –Novice/casual users Radial (cone) and Rectangular search –Intermediate/astro users Imaging and Spectro Form Query –Expert Free-form SQL, Object Crossid (upload RA/dec list) CasJobs workbench environment (MyDB) Visual tools –ImageCutout service Finding Chart, Navigate/browse images, image lists –Explore tool Detailed info for each image object, including spectrum Downloadable interfaces –Emacs, Java tool (sdssQA), sqlcl (command-line)

CAS Workload Management Query execution time follows power law Vast majority of queries under a minute Separate short and long queries Execute long queries in batch mode Short and long queues (short=1min, long=8hrs) Strictly limit time of query in a queue Provide user workspace DB – MyDB –Reduce network traffic from repeat and intermediate results –Allow sharing of query results between user groups

CasJobs and MyDB Batch Query Workbench for SDSS CASBatch Query Workbench for SDSS CAS Queries are queued for batch execution –Load balancing – queues on multiple servers –Limit of 2 simultaneous queries per server Short (1 minute) queue for immediate mode –Query aborted after 1 minute MyDB personal database –1 GB (more on demand) SQL DB for each user –Long queries write to MyDB table by default –User can extract output (download) when ready –Ability to share MyDB tables with other users via group visibility

SkyServer Entire database of the Sloan Digital Sky Survey –Free to anyone –4 TB of data –287 million sky objects –1.2 million spectra

Why data to the public? All SDSS data available to general public –Commitment to data sharing –People can look at any star or galaxy –Inquiry learning known to be effective (learners answer a question themselves, with their own design)

Browse Data

Explore Data

Search for Data

From Access to Learning Not enough just to give public access to data How are they using it? Are they learning anything from it? –Do they understand what they’re doing and why?* *Understanding = long-term memory, transfer to new situations (How People Learn)

Formal Education: Projects

SkyServer projects Three levels –Kids (K-8) –Basic (high school or Astro 101) –Advanced (skilled and motivated students) Research challenges –Independent research (open inquiry) with data Activities created by users

SkyServer projects

Use of educational projects 2 of top 20 IPs to hit entire site are K-12 districts –Orlando, FL & Los Angeles

Citizen science: Galaxy Zoo Background: galaxies come in different shapes Classifying is hard for computers, easy for people So get the public to help! spiral elliptical

Galaxy Zoo

Galaxy Zoo Discovery But if you mirror galaxies, it’s still 48/52! Counter-clockwise excess due to perception Unintentional psychology discovery! clockwise (48%) counter-clockwise (52%)

Galaxy Zoo Discovery “Hanny’s Voorwerp” (“Voorwerp” = object) Strange blue thing found by Dutch teacher Hanny van Arkel Most likely “light echo” from a long-dead quasar Follow-up time on HST Galaxy Zoo great for finding the unexpected!

Galaxy Zoo and learning Even people who love science often don’t understand peer review, proposals, etc. Want to use Galaxy Zoo to show process of science As we write science papers, we share results with volunteers

Galaxy Zoo blog

Other Science: Sensors Small, self-contained devices to measure soil temperature, etc. Deployed around Maryland Study carbon cycle, nesting of turtles, etc. Wide range of applications

Other science: Turbulence Large-scale simulation of isotropic turbulence in fluid –Imagine shaking a box of water FORTRAN code 1024 x 1024 x 1024 mesh 10,240 timesteps Every 10 th timestep written to SQL database

Other science: Turbulence Database accessible online through web services Can call database from FORTRAN or C code Like having a supercomputer simulation on your laptop!

Graywulf Data-intensive computing architecture –40 worker nodes, 1440 MB/s each –6 head nodes, 2100 MB/s each –DDR InfiniBand interconnect Deployed mid-2008 Architecture for future projects

Graywulf at SC08

Contact information Contact Information Jordan Raddick