The Pan-STARRS Data Challenge Jim Heasley Institute for Astronomy University of Hawaii ICS 624 – 28 March 2011.

Slides:

Advertisements

Similar presentations

Trying to Use Databases for Science Jim Gray Microsoft Research

Advertisements

Overview of Current and Forthcoming GALEX Search Capabilities and Data Products Current Search Options New GALEX Fluxes gPhoton.

RepoMMan: using Web Services and BPEL to facilitate workflow interaction with a digital repository Richard Green.

The Key Players Maria Nieto-Santisteban (JHU) Ani Thakar (JHU)

Building the Trident Scientific Workflow Workbench for Data Management in the Cloud Roger Barga, MSR Yogesh Simmhan, Ed Lazowska, Alex Szalay, and Catharine.

Vestrand Real Time Transient Detection with RAPTOR: Exploring the Path Toward a “Thinking” Telescope Tom Vestrand on behalf of the RAPTOR Team Los Alamos.

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.

June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.

CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University.

A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.

Data-Intensive Computing in the Science Community Alex Szalay, JHU.

Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.

Systems Analysis and Design in a Changing World, 6th Edition 1 Chapter 6.

Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.

Pan-STARRS: Learning to Ride the Data Tsunami María A. Nieto-Santisteban 1, Yogesh Simmhan 3, Roger Barga 3, Tamas Budávari 1, László Dobos 1, Nolan Li.

Astro-DISC: Astronomy and cosmology applications of distributed super computing.

Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.

Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.

Designing a Data Warehouse Issues in DW design. Three Fundamental Processes Data Acquisition Data Storage Data a Access.

Ch 4. The Evolution of Analytic Scalability

Section 11.1 Identify customer requirements Recommend appropriate network topologies Gather data about existing equipment and software Section 11.2 Demonstrate.

Why Build Image Mosaics for Wide Area Surveys? An All-Sky 2MASS Mosaic Constructed on the TeraGrid A. C. Laity, G. B. Berriman, J. C. Good (IPAC, Caltech);

Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.

László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug , 2008.IDIES Inaugural Symposium, Baltimore1.

ITEC224 Database Programming

Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.

National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:

1 New Frontiers with LSST: leveraging world facilities Tony Tyson Director, LSST Project University of California, Davis Science with the 8-10 m telescopes.

 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 

Database System Concepts and Architecture

IMDGs An essential part of your architecture. About me

Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,

MWA Data Capture and Archiving Dave Pallot MWA Conference Melbourne Australia 7 th December 2011.

Chapter 4 Realtime Widely Distributed Instrumention System.

ASI-Eumetsat Meeting Matera, 4-5 Feb CNM Context Matera, February 4-5, 20092ASI-Eumetsat Meeting.

The Pan-STARRS Data Challenge Jim Heasley Institute for Astronomy University of Hawaii.

LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.

Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by.

Chapter 1 Introduction to Databases. 1-2 Chapter Outline   Common uses of database systems   Meaning of basic terms   Database Applications  

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

Virtual Survey System sept 04 ASTRO-WISE- federation OmegaCEN AstroWise a Virtual Survey System OmegaCAM – Lofar – AstroGrid –((G)A) VO AstroWise a Virtual.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

DATABASE MANAGEMENT SYSTEM ARCHITECTURE

OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.

The concept of RAID in Databases By Junaid Ali Siddiqui.

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

1 Imaging Surveys: Goals/Challenges May 12, 2005 Luiz da Costa European Southern Observatory.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

PS1 PSPS Object Data Manager Design PSPS Critical Design Review November 5-6, 2007 IfA.

Pan-STARRS PS1 Published Science Products Subsystem Presentation to the PS1 Science Council August 1, 2007.

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.

Lecture 3 With every passing hour our solar system comes forty-three thousand miles closer to globular cluster 13 in the constellation Hercules, and still.

Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.

Slide 1 PS1 PSPS Object Data Manager Design PSPS Critical Design Review November 5-6, 2007 IfA.

Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server

Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,

Enhancements to Galaxy for delivering on NIH Commons

Accessing the VI-SEEM infrastructure

Physical Changes That Don’t Change the Logical Design

Lecture 8 Database Implementation

Planning Observations

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

Ch 4. The Evolution of Analytic Scalability

Clouds & Containers: Case Studies for Big Data

LSST, the Spatial Cross-Match Challenge

Best Practices in Higher Education Student Data Warehousing Forum

Presentation transcript:

The Pan-STARRS Data Challenge Jim Heasley Institute for Astronomy University of Hawaii ICS 624 – 28 March 2011

ICS624 What is Pan-STARRS? Pan-STARRS - a new telescope facility 4 smallish (1.8m) telescopes, but with extremely wide field of view Can scan the sky rapidly and repeatedly, and can detect very faint objects –Unique time-resolution capability Project led by IfA with help from Air Force, Maui High Performance Computer Center, MIT’s Lincoln Lab. The prototype, PS1, will be operated by an international consortium

ICS624 Pan-STARRS Overview Time domain astronomy –Transient objects –Moving objects –Variable objects Static sky science –Enabled by stacking repeated scans to form a collection of ultra-deep static sky images Pan-STARRS observatory specifications –Four 1.8m R-C + corrector –7 square degree FOV - 1.4Gpixel cameras –Sited in Hawaii –A  = 50 –R ~ 24 in 30 s integration –-> 7000 square deg/night –All sky + deep field surveys in g,r,i,z,y

ICS624 The Published Science Products Subsystem

ICS624

Front of the Wave Pan-STARRS is only the first of a new generation of astronomical data programs that will generate such large volumes of data: –SkyMapper, southern hemisphere optical –VISTA, southern hemisphere IR survey –LSST, an all sky survey like Pan-STARRS Eventually, these data sets will be useful for data mining.

ICS624

PS1 Data Products Detections—measurements obtained directly from processed image frames –Detection catalogs –“Stacks” of the sky images source catalogs –Difference catalogs High significance (> 5  transient events) Low significance (transients between 3 and 5  ) –Other Image Stacks (Medium Deep Survey) Objects—aggregates derived from detections

ICS624 What’s the Challenge? At first blush, this looks pretty much like the Sloan Digital Sky Survey… BUT –Size – Over its 3 year mission, PS1 will record over 150 billion detections for approximately 5.5 billion sources –Dynamic Nature – new data will be always coming into the database system, for things we’ve seen before or new discoveries

Book Learning The books on database design tell you to –Interview your users to determine what they want to use the database for –Determine the most common queries your users are going to ask –Organize your data into a normalized logical schema –Select a physical schema appropriate to your problem. ICS624

Real World The infamous “20 Queries” of Alex Szalay (JHU) in designing the SDSS Normalized schema are good but can result in very big performance penalties Money talks – in the real world you are constrained by a budget and not all physical implementations of your database may be affordable (for one reason or another)! ICS624

PSPS Top Level Requirements The PSPS shall be able to ingest a total of 1.5x10 11 P2 detections, 8.3x10 10 cumulative sky detections, and 5.5 x10 9 celestial objects together with their linkages. ICS624

PSPS Top Level Requirements The PSPS shall be able to ingest the observational metadata for up to a total of 1.1x10 10 observations The PS1 PSPS shall be capable of archiving up to ~ 100 Terabytes of data. ICS624

PSPS Top Level Requirements The PSPS shall archive the PS1 data products The PSPS shall possess a computer security system to protect potentially vulnerable subsystems from malicious external actions. ICS624

PSPS Top Level Requirements The PSPS shall provide end-users access to detections of objects in the Pan- STARRS databases The PSPS shall provide end-users access to the cumulative stationary sky images generated by the Pan-STARRS. ICS624

PSPS Top Level Requirements The PSPS shall provide end-users with metadata required to interpret the observational legacy and processing history of the Pan-STARRS data products The PSPS shall provide end-users with Pan-STARRS detections of objects in the Solar System for which attributes can be assigned. ICS624

PSPS Top Level Requirements The PSPS shall provide end-users with derived Solar System objects deduced from Pan-STARRS attributed observations and observations from other sources The PSPS shall provide the capability for end-users to construct queries to search the Pan-STARRS data products over space and time to examine magnitudes, colors, and proper motions. ICS624

PSPS Top Level Requirements The PSPS shall provide a mass storage system with a reliability requirement of 99.9% (TBR) The PSPS baseline configuration should accommodate future additions of databases (i.e., be expandable). ICS624

How to Approach This Challenge There are many possible approaches to deal with this data challenge. Shared what? –Memory –Disk –Nothing Not all of these approaches are created equal, either in cost and/or performance (DeWitt & Gray, 1992, “Parallel Database Systems: The Future of High Performance Database Processing”).

ICS624 Conversation with the Pan- STARRS Project Manager Jim: Tom, what are we going to do if the solution proposed by SAIC is more than you can afford? Tom: Jim, I’m sure you’ll think of something! Not long after that, SAIC did give us a hardware/software plan we couldn’t afford. Not long after, Tom resigned from the project to pursue other activities…

The SAIC ODM Architecture Proposal Ingest Query Single multi-processor machine High performance storage Objects Staging Ingest detections Clustered small processor machines High capacity storage Published detections Publish LEFT BRAIN RIGHT BRAIN ICS624

The SAIC ODM Architecture Proposal Ingest Query Single multi-processor machine High performance storage Objects Staging Ingest detections Clustered small processor machines High capacity storage Published detections Publish LEFT BRAIN RIGHT BRAIN $ ICS624

Conversation with the Pan- STARRS Project Manager The Pan-STARRS project teamed up with Alex Szalay and his database team at JHU as they were the only game in town with real experience building large astronomical databases.

ICS624 Building upon the SDSS Heritage In teaming with the group at JHU we hoped to build upon the experience and software developed for the SDSS. A key question was how could we scale the system to deal with the volume of data expected from PS1 (> 10X SDSS in the first year alone). The second key question, could the system keep up with the data flow. The heritage is more one of philosophy than recycled software, as to deal with the challenges posed by PS1 we’ve had to generate a great deal of new code.

High-Level Organization ICS624

Data Storage Logical Schema ICS624

The Object Data Manager The Object Data Manager (ODM) was considered to be the “long pole” in the development of the PS1 PSPS. Parallel database systems can provide both data redundancy and spreading very large tables that can’t fit on a single machine over multiple storage volumes. For PS1 (and beyond) we need both.

ICS624 Distributed Architecture The bigger tables will be spatially partitioned across servers called Slices Using slices improves system scalability Tables are sliced into ranges of ObjectID, which correspond to broad declination ranges ObjectID boundaries are selected so that each slice has a similar number of objects Distributed Partitioned Views “glue” the data together

Data Storage Logical Schema ICS624

Design Decisions: ObjID Objects have their positional information encoded in their objID –fGetPanObjID (ra, dec, zoneH) –ZoneID is the most significant part of the ID –objID is the Primary Key Objects are organized (clustered indexed) so nearby objects in the sky are stored on disk nearby as well It gives good search performance, spatial functionality, and scalability

ICS624 Telescope CSV Files CSV Files Image Procesing Pipeline (IPP) CSV Files CSV Files Load Workflow Load DB Cold Slice DB 1 Cold Slice DB 2 Warm Slice DB 1 Warm Slice DB 2 Merge Workflow Hot Slice DB 2 Hot Slice DB 1 Flip Workflow MainDB Distribute d View MainDB Distribute d View MainDB Distribute d View CASJobs Query Service CASJobs Query Service MyDB The Pan-STARRS Science Cloud ← Behind the Cloud|| User facing services → Validation Exception Notification Data Valet Workflows Data Consumer Queries & Workflows Data flows in one direction→, except for error recovery Slice Fault Recover Workflow Data Creators Astronomers (Data Consumers) Admin & Load-Merge Machines Production Machines Pan-STARRS Data Flow

ICS624 Pan-STARRS Data Layout Slice 1 Slice 1 Slice 2 Slice 2 Slice 3 Slice 3 Slice 4 Slice 4 Slice 5 Slice 5 Slice 6 Slice 6 Slice 7 Slice 7 Slice 8 Slice 8 S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 S9S9 S9S9 S 10 S 10 S 11 S 11 S 12 S 12 S 13 S 13 S 14 S 14 S 15 S 15 S 16 S 16 S 16 S 16 S3S3 S3S3 S2S2 S2S2 S5S5 S5S5 S4S4 S4S4 S7S7 S7S7 S6S6 S6S6 S9S9 S9S9 S8S8 S8S8 S 11 S 11 S 10 S 10 S 13 S 13 S 12 S 12 S 15 S 15 S 14 S 14 S1S1 S1S1 Load Merge 1 Load Merge 1 Load Merge 2 Load Merge 2 Load Merge 3 Load Merge 3 Load Merge 4 Load Merge 4 Load Merge 5 Load Merge 5 Load Merge 6 Load Merge 6 S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 S9S9 S9S9 S 10 S 10 S 11 S 11 S 12 S 12 S 13 S 13 S 14 S 14 S 15 S 15 S 16 S 16 csv ImagePipeline HOTHOTHOTHOT WARMWARMWARMWARM Main Distributed View Head 2 Head 1 Slice Nodes Load-Merge Nodes COLDCOLDCOLDCOLD L1 Data L2 Data LOADLOADLOADLOAD Head Nodes

ICS624 The ODM Infrastructure Much of our software development has gone into extending the ingest pipeline developed for SDSS. Unlike SDSS, we don’t have “campaign” loads but a steady from of data from the telescope through the Image Processing Pipeline to the ODM. We have constructed data workflows to deal with both the regular data flow into the ODM as well as anticipated failure modes (lost disk, RAID, and various severer nodes).

ICS624 Pan-STARRS Object Data Manager Subsystem Pan-STARRS Cloud Services for Astronomers System Operation UI System Health Monitor UI Query Performance UI System & Administration Workflows Orchestrates all cluster changes, such as, data loading, or fault tolerance System & Administration Workflows Orchestrates all cluster changes, such as, data loading, or fault tolerance Configuration, Health & Performance Monitoring Cluster deployment and operations Configuration, Health & Performance Monitoring Cluster deployment and operations Internal Data Flow and State Logging Tools for supporting workflow authoring and execution Internal Data Flow and State Logging Tools for supporting workflow authoring and execution Loaded Astronomy Databases ~70TB Transfer/Week Loaded Astronomy Databases ~70TB Transfer/Week Deployed Astronomy Databases ~70TB Storage/Year Deployed Astronomy Databases ~70TB Storage/Year Query Manager Science queries and MyDB for results Query Manager Science queries and MyDB for results Image Processing Pipeline Extracts objects like stars and galaxies from telescope images ~1TB Input/Week Image Processing Pipeline Extracts objects like stars and galaxies from telescope images ~1TB Input/Week Pan-STARRS Telescope Data Flow Control Flow 36

ICS624 What Next? Will this approach scale to our needs? –PS1 – yes. But, we already see the need for better parallel processing query plans. –PS4 – unclear! Even though I’m not from Missouri, “show me!” One year of PS4 produces > data volume than the entire PS1 3 year mission! Column based databases? Cloud computing? –How can we test issues like scalability without actually building the system? –Does each project really need its own data center? –Having these databases “in the cloud” may greatly facilitate data sharing/mining.