Introduction to LSST Data Management Jeffrey Kantor Data Management Project Manager.

Slides:



Advertisements
Similar presentations
1 US activities and strategy :NSF Ron Perrott. 2 TeraGrid An instrument that delivers high-end IT resources/services –a computational facility – over.
Advertisements

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
HP Quality Center Overview.
NOAO Brown Bag May 13, 2008 Tucson, AZ 1 Data Management Middleware NOAO Brown Bag Tucson, AZ May 13, 2008 Jeff Kantor LSST Corporation.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Jeff Kantor LSST Data Management Systems Manager LSST Corporation Institute for Astronomy University of Hawaii Honolulu, Hawaii June 19, 2008 LSST Data.
ARCS Data Analysis Software An overview of the ARCS software management plan Michael Aivazis California Institute of Technology ARCS Baseline Review March.
DCS Architecture Bob Krzaczek. Key Design Requirement Distilled from the DCS Mission statement and the results of the Conceptual Design Review (June 1999):
Building a Framework for Data Preservation of Large-Scale Astronomical Data ADASS London, UK September 23-26, 2007 Jeffrey Kantor (LSST Corporation), Ray.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
1 HOT-WIRING THE TRANSIENT UNIVERSE | SANTA BARBARA, CA | MAY 12-15, 201 Name of Meeting Location Date - Change in Slide Master LSST Alert Production Pipelines.
Architectural Design Establishing the overall structure of a software system Objectives To introduce architectural design and to discuss its importance.
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
Commissioning the NOAO Data Management System Howard H. Lanning, Rob Seaman, Chris Smith (National Optical Astronomy Observatory, Data Products Program)
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Chapter 2 The process Process, Methods, and Tools
The Climate Prediction Project Global Climate Information for Regional Adaptation and Decision-Making in the 21 st Century.
DCS Overview MCS/DCS Technical Interchange Meeting August, 2000.
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
1 Radio Astronomy in the LSST Era – NRAO, Charlottesville, VA – May 6-8 th LSST Survey Data Products Mario Juric LSST Data Management Project Scientist.
NCSX NCSX Preliminary Design Review ‒ October 7-9, 2003 G. Oliaro 1 G. Oliaro - WBS 5 Central Instrumentation/Data Acquisition and Controls Princeton Plasma.
Relationships July 9, Producers and Consumers SERI - Relationships Session 1.
1 Hot-Wiring the Transient Universe Santa Barbara CA May 12, 2015 LSST + Tony Tyson UC Davis LSST Chief Scientist.
ALMA Software B.E. Glendenning (NRAO). 2 ALMA “High Frequency VLA” in Chile Presently a European/North American Project –Japan is almost certainly joining.
Chapter 4 Realtime Widely Distributed Instrumention System.
OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative OOI Cyberinfrastructure Architecture Overview Michael Meisinger Life Cycle Architecture Review.
NOAO Brown Bag Tucson, AZ March 11, 2008 Jeff Kantor LSST Corporation Requirements Flowdown with LSST SysML and UML Models.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Ocean Observatories Initiative OOI Cyberinfrastructure Architecture Overview Michael Meisinger September 29, 2009.
CRISP & SKA WP19 Status. Overview Staffing SKA Preconstruction phase Tiered Data Delivery Infrastructure Prototype deployment.
N. RadziwillEVLA NSF Mid-Project Report May 11-12, 2006 NRAO End to End (e2e) Operations Division Nicole M. Radziwill.
Middleware for FIs Apeego House 4B, Tardeo Rd. Mumbai Tel: Fax:
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
T Project Review WellIT PP Iteration
DC2 Post-Mortem/DC3 Scoping February 5 - 6, 2008 DC3 Goals and Objectives Jeff Kantor DM System Manager Tim Axelrod DM System Scientist.
NanoHUB.org and HUBzero™ Platform for Reproducible Computational Experiments Michael McLennan Director and Chief Architect, Hub Technology Group and George.
Information Systems Engineering. Lecture Outline Information Systems Architecture Information System Architecture components Information Engineering Phases.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
1 CORE DATA PROCESSING SOFTWARE PLAN REVIEW | SEATTLE, WA | SEPTEMBER 19-20, 2013 Name of Meeting Location Date - Change in Slide Master Data Release Processing.
Astronomy, Petabytes, and MySQL MySQL Conference Santa Clara, CA April 16, 2008 Kian-Tat Lim Stanford Linear Accelerator Center.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
LSST VAO Meeting March 24, 2011 Tucson, AZ. Headquarters Site Headquarters Facility Observatory Management Science Operations Education and Public Outreach.
1 NGVLA WORKSHOP – DECEMBER 8, 2015 – NRAO SOCORRO, NM Name of Meeting Location Date - Change in Slide Master Computing for ngVLA: Lessons from LSST Jeffrey.
Ray Plante for the DES Collaboration BIRP Meeting August 12, 2004 Tucson Fermilab, U Illinois, U Chicago, LBNL, CTIO/NOAO DES Data Management Ray Plante.
Pan-STARRS PS1 Published Science Products Subsystem Presentation to the PS1 Science Council August 1, 2007.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
N. RadziwillEVLA Advisory Committee Meeting May 8-9, 2006 NRAO End to End (e2e) Operations Division Nicole M. Radziwill.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
NSF Middleware Initiative Purpose To design, develop, deploy and support a set of reusable, expandable set of middleware functions and services that benefit.
1 SUI/T Team – Technical lead: Xiuqin Wu – Science Lead: David Ciardi – System Engineer/Architect: John Rector, Trey Roby – System Design Scientist: Gregory.
PEER 2003 Meeting 03/08/031 Interdisciplinary Framework Major focus areas Structural Representation Fault Systems Earthquake Source Physics Ground Motions.
All Hands Meeting 2005 BIRN-CC: Building, Maintaining and Maturing a National Information Infrastructure to Enable and Advance Biomedical Research.
Northwest Indiana Computational Grid Preston Smith Rosen Center for Advanced Computing Purdue University - West Lafayette West Lafayette Calumet.
1 OSG All Hands SLAC April 7-9, 2014 LSST Data and Computing - Status and Plans Dominique Boutigny CNRS/IN2P3 and SLAC OSG All Hands meeting April 7-9,
Commissioning Planning
LSST Commissioning Overview and Data Plan Charles (Chuck) Claver Beth Willman LSST System Scientist LSST Deputy Director SAC Meeting.
Process 4 Hours.
VisIt Project Overview
What to Expect at the LSST Archive: The LSST Science Platform Mario Juric, University of Washington LSST Data Management Subsystem Scientist for the.
Managing the Project Lifecycle
Joslynn Lee – Data Science Educator
LSST Commissioning Overview and Data Plan Charles (Chuck) Claver Beth Willman LSST System Scientist LSST Deputy Director SAC Meeting.
Optical Survey Astronomy DATA at NCSA
Joseph JaJa, Mike Smorul, and Sangchul Song
API Aspect of the Science Platform
LDF “Highlights,” May-October 2017 (1)
NOAA OneStop and the Cloud
Presentation transcript:

Introduction to LSST Data Management Jeffrey Kantor Data Management Project Manager

LSST Data Management Principal Responsibilities Archive Raw Data: Receive the incoming stream of images that the Camera system generates to archive the raw images. Process to Data Products: Detect and alert on transient events within one minute of visit acquisition. Approximately once per year create and archive a Data Release, a static self-consistent collection of data products generated from all survey data taken from the date of survey initiation to the cutoff date for the Data Release. Publish: Make all LSST data available through an interface that uses community-accepted standards, and facilitate user data analysis and production of user-defined data products at Data Access Centers (DACs) and external sites.

LSST From the User’s Perspective A stream of ~10 million time-domain events per night, detected and transmitted to event distribution networks within 60 seconds of observation. A catalog of orbits for ~6 million bodies in the Solar System. A catalog of ~37 billion objects (20B galaxies, 17B stars), ~7 trillion observations (“sources”), and ~30 trillion measurements (“forced sources”), produced annually, accessible through online databases. Deep co-added images. Services and computing resources at the Data Access Centers to enable user-specified custom processing and analysis. Software and APIs enabling development of analysis codes. Level 3 Level 1 Level 2

02C Data Access Services 02C.07.01, 02C Processing Middleware 02C Infrastructure Services (System Administration, Operations, Security) 02C Long-Haul Communications Physical Plant (included in above) 02C Base Site Application Layer (LDM-151) Scientific Layer Pipelines constructed from reusable, standard “parts”, i.e. Application Framework Data Products representations standardized Metadata extendable without schema change Object-oriented, python, C++ Custom Software Middleware Layer (LDM-152) Portability to clusters, grid, other Provide standard services so applications behave consistently (e.g. provenance) Preserve performance (<1% overhead) Custom Software on top of Open Source, Off- the-shelf Software Infrastructure Layer (LDM-129) Distributed Platform Different sites specialized for real-time alerting, data release production, peta- scale data access Off-the-shelf, Commercial Hardware & Software, Custom Integration 02C Science Data Archive (Images, Alerts, Catalogs) 02C , 02C , 02C.03, 02C.04 Alert, SDQA, Calibration, Data Release Productions/Pipelines 02C.03.05, 02C Application Framework 02C.05 Science User Interface and Analysis Tools 02C Archive Site Data Management System Design (LDM-148) 02C SDQA and Science Pipeline Toolkits Data Management System Architecture

Mapping Data Products into Pipelines 02C /02.Data Quality Assessment Pipelines 02C Calibration Products Production Pipelines 02C Instrumental Signature Removal Pipeline 02C Single-Frame Processing Pipeline 02C Image Differencing Pipeline 02C Alert Generation Pipeline 02C Moving Object Pipeline 02C Coaddition Pipeline 02C.04.04/.05Association and Detection Pipelines 02C Object Characterization Pipeline 02C PSF Estimation 02C Science Pipeline Toolkit 02C.03.05/04.07Common Application Framework Level 1 Level 2 L3 Data Management Applications Design (LDM-151)

Infrastructure: Petascale Computing, Gbps Networks The computing cluster at the LSST Archive at NCSA will run the processing pipelines. Single-user, single-application data center Commodity computing clusters. Distributed file system for scaling and hierarchical storage Local-attached, shared-nothing storage when high bandwidth needed Long Haul Networks to transport data from Chile to the U.S. 2x100 Gbps from Summit to La Serena (new fiber) 2x40 Gbps for La Serena to Champaign, IL (path diverse, existing fiber) Archive Site and U.S. Data Access Center NCSA, Champaign, IL Archive Site and U.S. Data Access Center NCSA, Champaign, IL Base Site and Chilean Data Access Center La Serena, Chile Base Site and Chilean Data Access Center La Serena, Chile

Middleware Layer: Isolating Hardware, Orchestrating Software Enabling execution of science pipelines on hundreds of thousands of cores. Frameworks to construct pipelines out of basic algorithmic components Orchestration of execution on thousands of cores Control and monitoring of the whole DM System Isolating the science pipelines from details of underlying hardware Services used by applications to access/produce data and communicate "Common denominator" interfaces handle changing underlying technologies Data Management Middleware Design (LDM-152)

Database and Science UI: Delivering to Users Massively parallel, distributed, fault-tolerant relational database. To be built on existing, robust, well- understood, technologies (MySQL and xrootd) Commodity hardware, open source Advanced prototype in existence (qserv) Science User Interface to enable the access to and analysis of LSST data Web and machine interfaces to LSST databases Visualization and analysis capabilities More: Talks by Becla, Van Dyk

Critical Prototypes: Algorithms and Technologies Petascale Database Design Conducted parallel database tests up to 300 nodes, 100 TB of data, 100% of scale for operations year 1 Petascale Computing Design Executed in parallel on up to 10k cores (TeraGrid/XSEDE and NCSA Blue Waters hardware) with scalable results Algorithm Design Approximately 60% of the software functional capability has been prototyped Over 350,000 lines of c++, python coded, unit tested, integrated, run in production mode Have released three terabyte-scale datasets, including single frame measurements, point source and galaxy photometry Pre-cursors leveraged Pan-STARRS, SDSS, HSC Gigascale Network Design Currently testing at up to 1 Gbps Agreements in principle are in hand with key infrastructure providers (NCSA, FIU/AmPath, REUNA, IN2P3)

Data Management Scope is Defined and Requirements are Established Data Product requirements have been vetted with Science Collaborations multiple times and have successfully passed review (Jul ‘13) Data quality and algorithmic assessments are far advanced and we understand the risks, successfully passed review (Sep ‘13) Hardware sizing has been refreshed based on latest scientific and engineering requirements, system design, technology trends, software performance profiles, acquisition strategy Interfaces are defined to Phase 2 level Requirements and Final Design have been baselined (Data Management Technical Control Team) Traceability from OSS to DMSR has been verified All WBS elements have been estimated and scheduled in PMCS with scope and basis of estimate documented

Data Management ICDs needed for Construction start are at Phase 2 Level √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ under formal change control in progress (Phase 1) √ √ √ √ √ √ √ √ √ √ ICDs on Confluence: Docushare:

Going Where the Talent is: Distributed Team Infrastructure Middleware Science Pipelines Database User Interfaces Mgmt, I&T, and Science QA

Science User Interface & Tools X. Wu D. Ciardi IPAC Project Manager J. Kantor Project Scientist M. Juric System Architecture K-T. Lim G. Dubois-Felsmann SLAC Survey Science Group SSG Lead Scientist TBD F. Economou LSST Alert Production A.Connolly UW/OPEN International Comms/Base Site R. Lambert NOAO Processing Services & Site Infrastructure D. Petravick NCSA Science Database & Data Acc Services J. Becla SLAC Data Release Production R. Lupton J. Swinbank Princeton Data Management Organization document-139 Data Management Organization LSST DM Leadership DM Lead institutions are integrated into one project and are performing in their construction roles/responsibilities

Leveraging national and international investments NSF/OCI Funded – Formal relationships continue with the IRNC-funded AmLight project and they are the lead entity in securing Chile - US network capacity for LSST – We have leveraged significant XSEDE and Blue Waters Service Unit and storage allocations for critical R&D phase prototypes and productions – Our LSST Archive Center and US Data Access Center will hosted in the National Petascale Computing Facility at NCSA – A strong relationship has been established with the Condor Group at the University of Wisconsin and HTCondor is now in our processing middleware baseline – We have reused a wide range of open source software libraries and tools, many of which received seed funding from the NSF Other National/International Funded – We have participated in joint development of astronomical software with Pan-STARRS and HSC – We have fostered collaborative development of scientific database technology via the eXtremely Large Data Base (XLDB) conferences and collaborations with database developers (e.g. SciDB, MySQL, MonetDB) – We have a deep process of community engagement to deliver products that are needed, and an architecture to allow the community to deliver their own tools

Data Management is Construction Ready The Data Management System is scoped and credibly estimated – Requirements have been baselined and are achievable (LSE-61) – Final Design baselined (LDM-148, -151, 152, -129, -135) – Approximately 60% of the software functional capability has been prototyped – Data and algorithmic assessments are far advanced and we understand the risks – Hardware sizing has been done based on scientific and engineering requirements, system design, technology trends, software performance profiles, acquisition strategy – All lowest level WBS elements have been estimated and scheduled in PMCS with scope and basis of estimate documented All lead institutions are demonstrably integrated into one project and are performing in their construction roles/responsibilities – Core lead technical personnel are on board at all institutions – Agreements in principle are in hand with key technology and center providers (NCSA, NOAO, FIU/AmPath, REUNA) The software development process has been exercised fully – Have successfully executed eight software and data releases – Standard/formal processes, tools, environment exercised repeatedly and refined – Automated build, test environment is configured and exercised nightly/weekly Data Management PMCS plans current and complete