Pan-STARRS: Learning to Ride the Data Tsunami María A. Nieto-Santisteban 1, Yogesh Simmhan 3, Roger Barga 3, Tamas Budávari 1, László Dobos 1, Nolan Li.

Slides:

Advertisements

Similar presentations

Trying to Use Databases for Science Jim Gray Microsoft Research

Advertisements

GRADD: Scientific Workflows. Scientific Workflow E. Science laboris Workflows are the new rock and roll of eScience Machinery for coordinating the execution.

World Wide Telescope mining the Sky using Web Services Information At Your Fingertips for astronomers Jim Gray Microsoft Research Alex Szalay Johns Hopkins.

Trident Scientific Workflow Workbench Nelson Araujo, Roger Barga, Tim Chou, Dean Guo, Jared Jackson, Nitin Gautam, Yogesh Simmhan, Catharine Van Ingen.

Scientific Collaborations in a Data-Centric World Alex Szalay The Johns Hopkins University.

Supervisor : Prof . Abbdolahzadeh

The Key Players Maria Nieto-Santisteban (JHU) Ani Thakar (JHU)

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

Building the Trident Scientific Workflow Workbench for Data Management in the Cloud Roger Barga, MSR Yogesh Simmhan, Ed Lazowska, Alex Szalay, and Catharine.

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.

László Dobos 1,2, Tamás Budavári 2, Nolan Li 2, Alex Szalay 2, István Csabai 1 1 Eötvös Loránd University, Budapest,

C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li Berkeley Water Center John Hopkins University Lawrence Berkeley Laboratory.

Organizing the Extremely Large LSST Database for Real-Time Astronomical Processing ADASS London, UK September 23-26, 2007 Jacek Becla 1, Kian-Tat Lim 1,

CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University.

Petascale Data Intensive Computing for eScience Alex Szalay, Maria Nieto-Santisteban, Ani Thakar, Jan Vandenberg, Alainna Wonders, Gordon Bell, Dan Fay,

20 Spatial Queries for an Astronomer's Bench (mark) María Nieto-Santisteban 1 Tobias Scholl 2 Alexander Szalay 1 Alfons Kemper 2 1. The Johns Hopkins University,

MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.

Data-Intensive Computing in the Science Community Alex Szalay, JHU.

User Patterns from SkyServer 1/f power law of session times and request sizes –No discrete classes of users!! Users are willing to learn SQL for advantage.

Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.

Astro-DISC: Astronomy and cosmology applications of distributed super computing.

SQL Server 2008 Overview Lubor Kollar, Group Program Manager.

Microsoft ® Application Virtualization 4.5 Infrastructure Planning and Design Series.

Passage Three Introduction to Microsoft SQL Server 2000.

Mobile Mapping Systems (MMS) for infrastructural monitoring and mapping are becoming more prevalent as the availability and affordability of solutions.

Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.

IMS 4212: Distributed Databases 1 Dr. Lawrence West, Management Dept., University of Central Florida Distributed Databases Business needs.

Alok 1Northwestern University Access Patterns, Metadata, and Performance Alok Choudhary and Wei-Keng Liao Department of ECE,

Supported by the National Science Foundation’s Information Technology Research Program under Cooperative Agreement AST with The Johns Hopkins University.

What is Architecture  Architecture is a subjective thing, a shared understanding of a system’s design by the expert developers on a project  In the.

László Dobos, Tamás Budavári, Alex Szalay, István Csabai Eötvös University / JHU Aug , 2008.IDIES Inaugural Symposium, Baltimore1.

Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.

Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.

Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.

National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:

material assembled from the web pages at

Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.

The Pan-STARRS Data Challenge Jim Heasley Institute for Astronomy University of Hawaii ICS 624 – 28 March 2011.

1 Where The Rubber Meets the Sky Giving Access to Science Data Talk at National Institute of Informatics, Tokyo, Japan October 2005 Jim Gray Microsoft.

Dr. Mohamed Osman Hegazi 1 Database Systems Concepts Database Systems Concepts Course Outlines: Introduction to Databases and DBMS. Database System Concepts.

Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research.

Sensor systems and large data sources Jim Myers, NCSA.

Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.

Overview of MSR External Research Earth, Energy, and MSR Environmental Ecosystem Conceptual Model Projects Trident GrayWulf Dyrad and DryadLinq.

The Pan-STARRS Data Challenge Jim Heasley Institute for Astronomy University of Hawaii.

LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma

VMware vSphere Configuration and Management v6

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 1 Automate your way to.

PS1 PSPS Object Data Manager Design PSPS Critical Design Review November 5-6, 2007 IfA.

Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.

Pan-STARRS PS1 Published Science Products Subsystem Presentation to the PS1 Science Council August 1, 2007.

Features Of SQL Server 2000: 1. Internet Integration: SQL Server 2000 works with other products to form a stable and secure data store for internet and.

Publishing Combined Image & Spectral Data Packages Introduction to MEx M. Sierra, J.-C. Malapert, B. Rino VO ESO - Garching Virtual Observatory Info-Workshop.

Lecture 3 With every passing hour our solar system comes forty-three thousand miles closer to globular cluster 13 in the constellation Hercules, and still.

1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.

Slide 1 PS1 PSPS Object Data Manager Design PSPS Critical Design Review November 5-6, 2007 IfA.

Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.

Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.

Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server

Supervisor : Prof . Abbdolahzadeh

Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng

CUAHSI HIS Sharing hydrologic data

Cross-matching the sky with database server cluster

Exploring Azure Event Grid

Introduction to D4Science

Ch 4. The Evolution of Analytic Scalability

Introduction to Dataflows in Power BI

Presentation transcript:

Pan-STARRS: Learning to Ride the Data Tsunami María A. Nieto-Santisteban 1, Yogesh Simmhan 3, Roger Barga 3, Tamas Budávari 1, László Dobos 1, Nolan Li 1, Jim Heasley 2, Conrad Holmberg 2, Michael Shipway 1, Alexander Szalay 1, Ani Thakar 1, Michael Thomassy 4, Jan vandenBerg 1, Catharine van Ingen 3, Suzanne Werner 1, Richard Wilton 1, Alainna Wonders 1 1. Johns Hopkins University 2. University of Hawaii 3. Microsoft Research 4. Microsoft SQL Server CAT Microsoft eScience Workshop, December 8 th 2008, Indianapolis, IN, U.S.A.

Agenda GrayWulf – Software Architecture Pan-STARRS – Riding the Data Tsunami 2 María Nieto Santisteban, 2008 Microsoft eScience Workshop

Motivation Nature of Scientific Computing is changing CPU cycles/s  IO/s Data storage and datasets grow faster than Moore’s Law Computation must be moved to the data, as the data cannot be moved to the computation Scientific community is in need of scalable solutions for data intensive computing 3 María Nieto Santisteban, 2008 Microsoft eScience Workshop

Requirements CHEAP! 4 María Nieto Santisteban, 2008 Microsoft eScience Workshop

Requirements CHEAP! Scalability Ability to grow & evolve over time Performance Fault tolerance and recovery Easy to operate Easy to use Easy to share 5 María Nieto Santisteban, 2008 Microsoft eScience Workshop

GrayWulf (Gray) Database & HPC technologies (Wulf) on commodity hardware 6 Jim, Peter, Alex, and Maria Super Computing 2003 SDSS DR1 1TB! María Nieto Santisteban, 2008 Microsoft eScience Workshop

Architecture 7 María Nieto Santisteban, 2008 Microsoft eScience Workshop

Pan-STARRS Panoramic Survey Telescope and Rapid Response System  Primary goal: Discover and characterize Earth-approaching objects, both steroids and comets that might pose a danger to our planet  Collected data is rich in scientific value and will be used for Solar System and Cosmology studies M51, The Whirlpool Galaxy, as imaged by the central 4 OTAs in the PS1 camera Comet Holmes, GPC1 camera in the PS1 telescope PS1 telescope Haleakala (Maui) 8

Pan-STARRS PS1 Digital map of the sky in 5 colors – CCD mosaic = 1.4 gigapixels – 2/3 of the sky, 3 times per month – Starts at end of 2008, and will run for 3.5 yrs Data Volume – >1 PB / year of raw image data (2.5 TB/night) – 5.5 e9 celestial objects – 350 e9 detections – ~100 TB SQL Server database built at JHU (PSPS-ODM) – 3 copies for performance, fault tolerance, and recovery 300 TB, the largest astronomy db in the world! Operated by the Pan-STARRS Science Consortium (PS1SC) – Hawaii + JHU + Harvard/CfA + Edinburgh/Durham/Belfast + Max Planck Society Raw image of Comet Holmes taken with the GPC1 camera early in the commissioning of the PS1 telescope 9 María Nieto Santisteban, 2008 Microsoft eScience Workshop

PS1 is phase 1 PS4: 4 identical telescopes in year 2012 – 4 PB / year in image data – 400 TB / year into a database PS4 Enclosure concept 10 María Nieto Santisteban, 2008 Microsoft eScience Workshop Pan-STARRS PS4

Pan-STARRS System Architecture DRL WBI Human ODM SSDM Other DM IPPMOPS Other SW Clients 11 María Nieto Santisteban, 2008 Microsoft eScience Workshop

SDSS/SkyServer Pioneer in very large data publishing – 470 million web hits in 6 years – 930,000 distinct users vs 15,000 astronomers – Delivered 50,000 hours of lectures to high schools – Delivered >100B rows of data – Everything is a power law – Over 2,000 ‘power users’, more than 2000 refereed papers 12 María Nieto Santisteban, 2008 Microsoft eScience Workshop

Similarities with SDSS Strong requirements on Spatial queries – HTM, Zones, Algebra on regions in SQL Server The “20 queries” design methodology – Queries from PAN-STARSS scientists – the 20 queries sample queries – Two objectives Find potential holes/issues in schema Serve as test queries for integrity and performance Query Manager, PS1 Data Retrieval Layer of the database, MyDB – Astronomers and Scientists get their own Sandbox – Data is available over the internet – Provides a place to share 13 María Nieto Santisteban, 2008 Microsoft eScience Workshop

Differences with SDSS Time dimension – Most regions will be observed multiple times (~60) – Must store and provide access to temporal history for variability studies Goes deeper and fainter: contains information from co-added images Loading process – Demanding and sophisticated Higher data volume to ingest Must add ~120 million new detections per day Must provide updated object information monthly (we plan on weekly) Database size = 100 TB 14 María Nieto Santisteban, 2008 Microsoft eScience Workshop

Pan-STARRS PS1 - monolothic Table Year 1Year 2Year 3Year 3.5 Objects 2.03 StackDetection StackApFlx StackModelFits P2Detection StackHighSigDelta Other Tables Indexes (+20%) Total Sizes are in TB María Nieto Santisteban, 2008 Microsoft eScience Workshop

Distributed Architecture The bigger tables are spatially distributed across servers called Slices Using slices improves system scalability Spatial information is embedded in ObjectID Tables are sliced into ranges of ObjectID, which correspond to broad declination ranges ObjectID boundaries are selected so that each slice has a similar number of objects Distributed Partitioned Views “glue” the data together 16 María Nieto Santisteban, 2008 Microsoft eScience Workshop

Design decisions Objects are distributed across slices Objects, and metadata tables are duplicated in the slices to provide parallelization of queries, and facilitate recovery Detections belong into their object’s slice Orphans belong to the slice where their position would allocate them – Orphans near slices’ boundaries will need special treatment Objects keep their original object identifier – Even though positional refinement might change their objectID 17 María Nieto Santisteban, 2008 Microsoft eScience Workshop

Pan-STARRS 18 Web Based Interface (WBI) Query Manager (QM) Perf. Monitor DRL PS1 Objects Detections Meta [Objects_s m ] [Detections_s m ] Meta PS1 database P_ s m P_ S 1 [Objects_s 1 ] [Detections_s 1 ] Meta Legend Database Full table [Sliced_table] Partitioned View Workflow Manager (WFM) Cluster Manager (CSM) Linked Servers María Nieto Santisteban, 2008 Microsoft eScience Workshop

Partitioning  Objects and Detections are partitioned by ObjectID  Object partitions in main server correspond to whole slices PS1 S1S1 SmSm Web Based Interface (WBI) Linked servers PS1 database Query Manager (QM) 19

Layered Architecture Layer 0 – Staging servers – Data transformation Layer 1 – Loading/Merger servers – Scrubbing, loading, merging, … Layer 2 – Slice and Head servers – User queries Layer 3 – Admin, Query Manager, and MyDBs servers – Workflows, Cluster management, user databases, … 20 María Nieto Santisteban, 2008 Microsoft eScience Workshop

S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 s9s9 s 10 s 11 s 12 s 13 s 14 s 15 s 16 s 16 s3s3 s2s2 s5s5 s4s4 s7s7 s6s6 s9s9 s8s8 s 11 s 10 s 13 s 12 s 15 s 14 s1s1 Staging 1 Fits csv Fits Fits 1 Staging 2 Fits csv Fits Fits n LM 1 LM 2 LM 3 LM 4 LM 5 LM 6 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 s9s9 s 10 s 11 s 12 s 13 s 14 s 15 s 16 DX IPP ODM S1S1 S1S1 L0 L1 L2 The Cold, the Hot, and the Warm

Pan-STARRS Layer 2 H1H1 H1H1 H2H2 H2H2 PS1 S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 s9s9 s 10 s 11 s 12 s 13 s 14 s 15 s 16 s 16 s3s3 s2s2 s5s5 s4s4 s7s7 s6s6 s9s9 s8s8 s 11 s 10 s 13 s 12 s 15 s 14 s1s1 PS1 22 María Nieto Santisteban, 2008 Microsoft eScience Workshop

Pan-STARRS Object Data Manager Subsystem Pan-STARRS Cloud Services for Astronomers System Operation UI System Health Monitor UI Query Performance UI System & Administration Workflows Orchestrates all cluster changes, such as, data loading, or fault tolerance System & Administration Workflows Orchestrates all cluster changes, such as, data loading, or fault tolerance Configuration, Health & Performance Monitoring Cluster deployment and operations Configuration, Health & Performance Monitoring Cluster deployment and operations MSR Trident Scientific Workbench Tools for supporting workflow authoring and execution MSR Trident Scientific Workbench Tools for supporting workflow authoring and execution Loaded Astronomy Databases ~70TB Transfer/Week Loaded Astronomy Databases ~70TB Transfer/Week Deployed Astronomy Databases ~70TB Storage/Year Deployed Astronomy Databases ~70TB Storage/Year Query Manager Science queries and MyDB for results Query Manager Science queries and MyDB for results Image Processing Pipeline Extracts objects like stars and galaxies from telescope images ~1TB Input/Week Image Processing Pipeline Extracts objects like stars and galaxies from telescope images ~1TB Input/Week Pan-STARRS Telescope Data Flow Control Flow 23 María Nieto Santisteban, 2008 Microsoft eScience Workshop

Conclusion Pan-STARRS PS1 database will be ~20 times bigger than Sloan Sky Server (SDSS); 10 year project in 10 days – Distributed solution required - the monolithic solution will not work – Highly scalable distributed database system has been designed to take advantage of current SQL Server 2008 capabilities, Windows Workflow Foundation, Windows HPC, and Trident Moore’s Law & exponential data growth - Petabytes/year by 2010 – Need scalable solutions, Move analysis to the data, Spatial and temporal features essential Same thing happening in all sciences – High energy physics, genomics, cancer research, medical imaging, oceanography, remote sensing, … Application of Pan-STARRS Pattern – 100+TB common 24 María Nieto Santisteban, 2008 Microsoft eScience Workshop