National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing.

Slides:



Advertisements
Similar presentations
© 2007 Open Grid Forum Grids in the IT Data Center OGF 21 - Seattle Nick Werstiuk October 16, 2007.
Advertisements

During spacecraft-level environmental testing, after all instruments were integrated, the government assembled a data clerk team to support test data collection.
Product Quality and Documentation – Recent Developments H. K. Ramapriyan Assistant Project Manager ESDIS Project, Code 423, NASA GFSC
Earth System Curator Spanning the Gap Between Models and Datasets.
EInfrastructures (Internet and Grids) US Resource Centers Perspective: implementation and execution challenges Alan Blatecky Executive Director SDSC.
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Facilitating Distributed.
May 17, Capabilities Description of a Rapid Prototyping Capability for Earth-Sun System Sciences RPC Project Team Mississippi State University.
NCAR GIS Program : Bridging Gaps
NSF and Environmental Cyberinfrastructure Margaret Leinen Environmental Cyberinfrastructure Workshop, NCAR 2002.
Princeton University Global Evaluation of a MODIS based Evapotranspiration Product Eric Wood Hongbo Su Matthew McCabe.
NASA World Wind. What is NASA World Wind? A richly interactive 3D planetary visualization tool. Smart client architecture. Portal for NASA data. Integrates.
CLIMATE SCIENTISTS’ BIG CHALLENGE: REPRODUCIBILITY USING BIG DATA Kyo Lee, Chris Mattmann, and RCMES team Jet Propulsion Laboratory (JPL), Caltech.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
NCPP – needs, process components, structure of scientific climate impacts study approach, etc.
Delivery of Forecasted Atmospheric Ozone and Dust for a Public Health Decision-Support System-Architecture and Functionality William B. Hudspeth, Jeff.
Computing in Atmospheric Sciences Workshop: 2003 Challenges of Cyberinfrastructure Alan Blatecky Executive Director San Diego Supercomputer Center.
SCIENCE-DRIVEN INFORMATICS FOR PCORI PPRN Kristen Anton UNC Chapel Hill/ White River Computing Dan Crichton White River Computing February 3, 2014.
CORDEX Scope, or What is CORDEX?  Provide a set of regional climate scenarios (including uncertainties) covering the period , for the majority.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Earth Data Science Planning Meeting #1 February 20, 2013.
, Implementing GIS for Expanded Data Accessibility and Discoverability ASDC Introduction The Atmospheric Science Data Center (ASDC) at NASA Langley Research.
C. Mattmann 1, C. Goodale 1, J. Kim 2, D.E. Waliser 1,2, D. Crichton 1, A. Hart 1, P. Zimdars 1 and Peter Lean* The International Workshop on CORDEX-East.
DOE BER Climate Modeling PI Meeting, Potomac, Maryland, May 12-14, 2014 Funding for this study was provided by the US Department of Energy, BER Program.
Planning for Arctic GIS and Geographic Information Infrastructure Sponsored by the Arctic Research Support and Logistics Program 30 October 2003 Seattle,
Page 1 Informatics Pilot Project EDRN Knowledge System Working Group San Antonio, Texas January 21, 2001 Steve Hughes Thuy Tran Dan Crichton Jet Propulsion.
Getting Ready for the Future Woody Turner Earth Science Division NASA Headquarters May 7, 2014 Biodiversity and Ecological Forecasting Team Meeting Sheraton.
1 A National Virtual Specimen Database for Early Cancer Detection June 26, 2003 Daniel Crichton NASA Jet Propulsion Laboratory Sean Kelly NASA Jet Propulsion.
Research and Educational Networking and Cyberinfrastructure Russ Hobby, Internet2 Dan Updegrove, NLR University of Kentucky CI Days 22 February 2010.
The Namibia Flood Dashboard Satellite Acquisition and Data Availability through the Namibia Flood Dashboard Matt Handy NASA Goddard Space Flight Center.
Where the Research Meets the Road: Climate Science, Uncertainties, and Knowledge Gaps First National Expert and Stakeholder Workshop on Water Infrastructure.
Earth Data Science Planning Meeting #2 March 7, 2013.
TRLN High Performance Data Storage System 21 Sep 2006 Jim Porto Ken Galluppi.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
EPA’s Role in the Global Earth Observation System of Systems (GEOSS)
Pascucci-1 Valerio Pascucci Director, CEDMAV Professor, SCI Institute & School of Computing Laboratory Fellow, PNNL Massive Data Management, Analysis,
ESIP Federation 2004 : L.B.Pham S. Berrick, L. Pham, G. Leptoukh, Z. Liu, H. Rui, S. Shen, W. Teng, T. Zhu NASA Goddard Earth Sciences (GES) Data & Information.
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Michelle Viotti, Manager,
Overview of CEOS Virtual Constellations Andrew Mitchell NASA CEOS SIT Team / WGISS NASA ESRIN – Frascati, Italy September 20, 2013 GEOSS Vision and Architecture.
Soil and Water Conservation Modeling: MODELING SUMMIT SUMMARY COMMENTS Dennis Ojima Natural Resource Ecology Laboratory COLORADO STATE UNIVERSITY 31 MARCH.
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California EDGE: The Multi-Metadata.
Data for Model Evaluations Karl E. Taylor Program for Climate Model Diagnosis and Intercomparison (PCMDI) Presented to the Fourth WCRP Observation and.
Cyberinfrastructure What is it? Russ Hobby Internet2 Joint Techs, 18 July 2007.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Chris A. Mattmann Senior Computer Scientist Jet Propulsion Laboratory D. Waliser (JPL) C. Goodale (JPL) J. Kim (UCLA/JIFRESSE Many others ADSIMNOR-CORDEX.
Thoughts on Stewardship, Archive, and Access to the National Multi- Model Ensemble (NMME) Prediction System Data Sets John Bates, Chief Remote Sensing.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
Regional Climate Model Evaluation System based on satellite and other observations for application to CMIP/AR downscaling Peter Lean 1, Jinwon Kim 1,3,
Cyberinfrastructure to promote Model - Data Integration Robert Cook, Yaxing Wei, and Suresh S. Vannan Oak Ridge National Laboratory Presented at the Model-Data.
   Alys Thomas 1, J.T. Reager 1,2, Jay Famiglietti 1,2,3, Matt Rodell 4 1 Dept. of Earth System Science, 2 UC Center for Hydrologic Modeling, 3 Dept.
1 Accomplishments. 2 Overview of Accomplishments  Sustaining the Production Earth System Grid Serving the current needs of the climate modeling community.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Earth System Curator and Model Metadata Discovery and Display for CMIP5 Sylvia Murphy and Cecelia Deluca (NOAA/CIRES) Hannah Wilcox (NCAR/CISL) Metafor.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
Environmental Information Infrastructure John R. Busby ERIN, Environment Australia.
External Communications Working Group Molly E Brown, NASA GSFC with WG team.
Vision of an Integrated Global Observing System Gregory W. Withee Assistant Administrator for Satellite and Information Services National Oceanic and Atmospheric.
Data Systems Integration Committee of the Earth Science Data System Working Group (ESDSWG) on Data Quality Robert R. Downs 1 Yaxing Wei 2, and David F.
Figure 3. Overview of system architecture for RCMES. A Regional Climate Model Evaluation System based on Satellite and other Observations Peter Lean 1.
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
Metadata Development in the Earth System Curator Spanning the Gap Between Models and Datasets Rocky Dunlap, Georgia Tech 5 th GO-ESSP Community Meeting.
Physical Oceanography Distributed Active Archive Center THUANG June 9-13, 20089th GHRSST-PP Science Team Meeting GHRSST GDAC and EOSDIS PO.DAAC.
Federal Land Manager Environmental Database (FED) Overview and Update June 6, 2011 Shawn McClure.
Tools to Assist with Assessing the Affected Environment.
USGS EROS LCMAP System Status Briefing for CEOS
Developing an OSSE Testbed at NASA/SIVO
INTAROS WP5 Data integration and management
Federal Land Manager Environmental Database (FED)
Future Data Architectures Big Data Workshop – April 2018
Carbon Model-Data Fusion
Snowfall changes and climate sensitivity
Presentation transcript:

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Challenges of Analyzing Large Environmental Data Sets Dan Crichton, Program Manager, Earth and Planetary Science Data Systems Amy Braverman, Senior Statistician NASA/JPL

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Massive Data Sets in the Environmental Sciences Environmental science areas (not exhaustive): –Climate change science/climate modeling Global Regional –Environmental quality Pollution Epidemiology Land use and natural resource management – Decision support and disaster management Climate change impacts Policy decisions and treaty enforcement Disaster response (flooding, drought, volcanoes, etc.)

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Massive Data Sets in Climate Science Climate model output: –originally intended as laboratory experiments to play what if (explore the physics by twiddling knobs and seeing what happens) –now have greater policy implications wrt predictions into the future, attribution of causes, and characterizing uncertainties Observations: –Improve process understanding and formulate hypotheses through exploratory data analysis –Improve parameterizations (statistical description of sub-grid-scale processes) –Establishment of long term data records – Model evaluation comparison of model output against observations weighting multi-model ensemble members

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Architecture Drivers: Data Intensive Science Increasing data volumes requiring new approaches for data production, validation, processing, discovery and data transfer/distribution (E.g., scalability relative to available resources) –Roughly doubling in size every two years –Shift from compute to data intensive Increased emphasis on usability of the data (E.g., discovery, access and analysis) Increasing diversity of data sets and complexity for integrating across missions/experiments (E.g., common information model for describing the data) –the benefits to science in bringing together and creating fused data products from multiple sources is critical in areas such as climatology where baseline data records are needed across measurements ** Increasing distribution of coordinated processing, operations and analysis (E.g., federation) On the fly analysis Increased pressure to reduce cost of supporting new missions Increasing desire for PIs to have integrated tool sets to work with data products with their own environments (E.g. perform their own generation and distribution)

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California NASA Earth Science Data Pipeline Data Acquisition and Command Instrume nt Operation s EDOS/GD S L0A Processin g Science Data Processing L0B L1 L2 L3 L4 SDS EOSDIS DAAC Science Data Management Archive & Distribution Instrume nt Operation s EDOS/GD S L0A Processin g Science Data Processing L0B L1 L2 L3 L4 SDS EOSDIS DAAC Science Data Management Archive & Distribution EOSDIS Data Centers Science Data Management Archive & Distribution Science Data Processing L0B L1 L2 L3 L4 Science Data Systems Instrument Operations EDOS/Groun d Data Systems L0A Processing Science Teams Outreach Research Mission Operation s TDRS Network On Board Processing

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California EOSDIS DAACs Earth Observing System Data and Information System Distributed Active Archive Centers

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Using Satellite Observations to Enable Climate Model Evaluation How to bring as much observational scrutiny as possible to the IPCC process? How to best utilize the wealth of NASA Earth observations for the IPCC process? Next Target : IPCC AR5 Model Output Available for Analysis Spring 2011 Papers Due ~ Late 2011/Early 2012 Report Completion 2013

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Earth System Grid Federation DOE-funded federation to distribute climate model output to the climate modeling community Common services for access to repositories and portals/gateways Highly decoupled Open source framework (software packaged and distributed) mandated by DOE SciDAC Program A Recent question….how do you link observations and climate model output?

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California ESG – NASA Integration

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Moving to Data Intensive Science Traditional Pipelines vs. Online Dynamic Services –Convergence between static pipelines and on-the-fly data processing and services Analysis of Distributed Data through Distributed Computational Services –Push computational services to data Fused Data Products –Generate new, fused data products Virtual Research Networks –Provide a computing infrastructure for collaborative research

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Traditional Analysis Approach User program must encode all functionality beyond gross-level access. Requires knowledge of specific instrument characteristics such as retrieval methods, format, measurement error characteristics and biases, etc. Difficulties multiply with more than one data source. Credit: Braverman, Mattmann, Crichton

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Emerging Paradigm for Analysis Push as much computation as possible to locations where the data reside; minimize data movement Deploy simple services to data centers that provide access and the computational functions to enable model-to-data analysis –Embrace service-oriented style of architecture Credit: Braverman, Mattmann, Crichton

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Data Integration Combining AIRS and MLS requires: –Rectifying horizontal, vertical and temporal mismatch –Assessing and correcting for the instruments scene- specific error characteristics (see left diagram)

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Model Intercomparison: Regional Example Collect User Choices (GUI / command line) Collect User Choices (GUI / command line) Load model data Retrieve obs from database Spatial re- gridding onto common grid Time averaging Area -averaging Annual cycle compositing Metric Calculation Plot production Model file RCMET optional e.g. calculate monthly means from daily data e.g. calculate area-weighted mean over user defined masked region e.g. calculate means of all Januarys, all Februarys etc e.g. calculate bias, RMS error etc e.g. map, time series plot, Taylor diagram Observations

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Computational Vision Data Acquisition and Command Instrume nt Operation s EDOS/GD S L0A Processin g Instrume nt Operation s EDOS/GD S L0A Processin g Instrument Operations EDOS/Groun d Data Systems L0A Processing Mission Operation s TDRS Network On Board Processing Network w/ Cloud Storage & Computation Applications Analysis, Modeling and Application Environments/Ga teways Other Data Systems (e.g. NOAA) Other Data Systems (e.g. NOAA) Other Data Systems (e.g. NOAA) Decision Support Science Data Processing Science Data Manage NASA Mission/Multi- Mission Data & Science Centers Science Data Manage NASA Mission/Multi- Mission Data & Science Centers Science Data Manage NASA Mission/Multi- Mission Data & Science Centers Research Science Teams

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Research Challenges for Statistics What architectural design produces the most efcient system topology for the types of data movement that will be required given scientic objectives? Can we study this as an optimization problem? How do we design computational methods that exploit the system topology and its distributed nature? Need algorithms that operate on distributed data to produce statistics of interest, or approximations. Study this trade-off. Data analysis choreography: how to assemble algorithms most efciently given a set of analysis goals? How to optimize the movement of data? How can statistics and other disciplines (e.g., computer science) education be better aligned?

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Summary Signicant efciencies may be achieved by thinking of data analysis and data access together rather than thinking of them as serial operations. In this paradigm, data sets are not static entities. They are virtual, possibly streaming data structures owing across the internet, manipulated and combined on-the-y as necessary for specic analyses. We need new statistical methods and algorithms optimized for this type of environment.