Download presentation
Presentation is loading. Please wait.
Published byHilda Blake Modified over 9 years ago
1
Understanding and Comparing Remote Sensing Data to Model Output Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant Professor, Univ. of Southern California Member, Apache Software Foundation
2
Roadmap Motivation Background Earth System Grid, NASA Inserting observations into AR5 Why is this so difficult? Data management issues Architectural issues Approaches for dealing with observations and models Approaches for comparing observations to models Architectural patterns Example: AIRS Level 2 data to NCAR CCSM model output Tool support Wrap-up 25-Mar-112CORDEX-MATTMANN
3
And you are? Apache Member involved in – OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor) Senior Computer Scientist at NASA JPL in Pasadena, CA USA Software Architecture/Engineering Prof at Univ. of Southern California 25-Mar-113CORDEX-MATTMANN
4
Motivation 4 How to bring as much observational scrutiny as possible to the IPCC process? How to best utilize the wealth of NASA Earth science information for the IPCC process? 25-Mar-11CORDEX-MATTMANN Credit: Waliser, Teixeira, Crichton, Ferraro
5
Inserting Observations in the IPCC Observations play a critical role in climate research – Process understanding Exploratory data analysis Hypothesis formulation – Parameterization and model development Statistical description of sub-grid-scale processes Hypothesis testing – Model evaluation (IPCC) Comparison of model output against observations Weighting multi-model ensemble members (“scoring") NASA is at a critical juncture in inserting observations into AR5 – Climate research community recognizes the importance of comparing models-to- data – The infrastructures, different formats, etc make this a challenging problem – Time, however, is limited 25-Mar-115CORDEX-MATTMANN Credit: Amy Braverman
6
DOE Earth System Grid Purpose – Provide climate researchers worldwide with access to data, information, models, analysis tools, and computational resources required to make sense of enormous climate simulation datasets Scope – Petabyte-scale data volumes – Gateway to climate change data products, model outputs and informational sites (i.e., globally federated sites) – Comprehensive registry of climate change Earth Science research results and components – Support climate change and its partner scientists, analysts, data managers, educators and decision makers – Resource to national and international science and societal benefit initiatives – Resource to climate change data products through interoperable web service and climate analysis tools Credit: Dean Williams 25-Mar-116CORDEX-MATTMANN
7
ESG Principal Sites Credit: Dean Williams 25-Mar-117CORDEX-MATTMANN
8
ESG Conceptual Overview Standard Browser, Web Services Credit: Dean Williams 25-Mar-118CORDEX-MATTMANN
9
The Next-generation ESG Independent gateways federating metadata, users. Individual data nodes responsible for publishing services. Designed for model output data sets. 25-Mar-119CORDEX-MATTMANN
10
ESG Gateways and Nodes Federated architecture – Federation is a virtual trust relationship among independent management domains that have their own set of services. Users authenticate once to gain access to data across multiple systems and organizations Gateways – Where data is discovered, requested – Portals, search capability, distributed metadata, registration and user management – May be customized to an institution’s requirements, topical focus – More complex architecture than nodes, fewer sites – Initially PCMDI, NCAR, ORNL, eventually GFDL Nodes – Where data is stored and published – Data may be on disk or tertiary mass store – Each data node can publish to any gateway (facilitates topical gateways) – Data reduction/analysis – Less complex architecture, including possible minimalist deployment w/o services – Anticipate ~20 data nodes for CMIP5, many others have expressed interest Sites A site can be both a gateway and a data node Credit: Dean Williams 25-Mar-1110CORDEX-MATTMANN
11
NASA Distributed Active Archive Centers (DAACs) 25-Mar-11CORDEX-MATTMANN11
12
NASA Earth Science Data: Broader Picture 25-Mar-11CORDEX-MATTMANN12
13
Observations in AR5 In AR4, the Earth System Grid played an input role in providing models for climate research In AR5, the ESG is being extended as a fully, distributed online data system to support access to climate models via the ESG portals What is needed, however, is the link to satellite observations and the convergence between the observational and modeling communities The reliability of projections could be improved if the models were weighted according to some measure of skill... Since there is no verification for a climate forecast on timescales of decades to centuries, the skill or performance of the models needs to be defined, for example, by comparing simulated patterns of present day climate to observations. Scoping of the IPCC 5 th Assessment Report, IPCC Working Group, April 2009 25-Mar-1113CORDEX-MATTMANN
14
Long Term Objective Establish a NASA-wide capability for the climate modeling community to support model-to-data intercomparison: – Ensure observations are available along-side models – Develop a common approach for sharing observations with the climate research community – Leverage existing data systems within NASA and ESG – Ensure that NASA R&A programs have the necessary infrastructure to support model-to-data verification and data analysis – Provide phased capabilities for AR5 and AR6 Develop a strong collaboration between observation and modeling communities (both science and technical) – JPL and PCMDI have a very good working relationship 25-Mar-1114CORDEX-MATTMANN
15
Challenges with Observational Data Massive – They entail detailed information about processes through multivariate distributions on multiple spatial and temporal scales Heterogeneous – Have variety of organizational structures, retrieval methods, sampling characteristics, and meaning (not like model output!) Distributed – Are stored all over the country and the world with EOSDIS being a principal infrastructure Analysis – Access and computational capabilities are needed to assemble and perform analysis “on-the-fly" 25-Mar-1115CORDEX-MATTMANN
16
Traditional Paradigm User program must encode all functionality beyond gross-level access. Requires knowledge of specific instrument characteristics such as retrieval methods, format, measurement error characteristics and biases, etc. Difficulties multiply with more than one data source. 25-Mar-1116CORDEX-MATTMANN Credit: Braverman, Mattmann, Crichton
17
Emerging Paradigm Push as much computation as possible to locations where the data reside; minimize data movement Deploy simple services to data centers that provide access and the computational functions to enable model-to-data analysis – Embrace service-oriented style of architecture 25-Mar-1117CORDEX-MATTMANN Credit: Braverman, Mattmann, Crichton
18
Science Data File Formats Hierarchical Data Format (HDF) – http://www.hdfgroup.org http://www.hdfgroup.org – Versions 4 and 5 – Lots of NASA data is in 4, newer NASA data in 5 – Encapsulates Observation (Scalars, Vectors, Matrices, NxMxZ…) Metadata (Summary info, date/time ranges, spatial ranges) – Custom readers/writers/APIs in many languages C/C++, Python, Java – Most NASA observational data is in HDF format 25-Mar-1118CORDEX-MATTMANN
19
Science Data File Formats network Common Data Form (netCDF) – www.unidata.ucar.edu/software/netcdf/ www.unidata.ucar.edu/software/netcdf/ – Versions 3 and 4 – Heavily used in DOE, NOAA, etc. – Encapsulates Observation (Scalars, Vectors, Matrices, NxMxZ…) Metadata (Summary info, date/time ranges, spatial ranges) – Custom readers/writers/APIs in many languages C/C++, Python, Java – Not Hierarchical representation: all flat – Most climate model output is in netCDF 25-Mar-1119CORDEX-MATTMANN
20
Tools to extract data from scientific data formats? There are actually quite a few that range from… – GUIs and higher level (more sophisticated) software R, Matlab, IDL, NCL, etc. Intermediate APIs: NetCDF-Java, NetCDF C API, HDF4/5 API – Low level, command-line tools UNIX strings command One concern: Decimate the binary file format and give you – Metadata (Start/End date time boundaries, spatial boundaries, abstract, investigator name, mission name, etc.) – The actual data Let’s take an example: Apache Tika: metadata 25-Mar-11CORDEX-MATTMANN20
21
is… A content analysis and detection toolkit A set of Java APIs providing MIME type detection, language identification, integration of various parsing libraries A rich Metadata API for representing different Metadata models A command line interface to the underlying Java code A GUI interface to the Java code http://tika.apache.org 25-Mar-1121CORDEX-MATTMANN
22
Bootstrapping Download Tika from: – http://tika.apache.org/download.html http://tika.apache.org/download.html Grab tika-app-0.9.jar – http://repo1.maven.org/maven2/org/apache/tika /tika-app/0.9/tika-app-0.9.jar http://repo1.maven.org/maven2/org/apache/tika /tika-app/0.9/tika-app-0.9.jar alias tika “java –jar tika-app-0.9.jar” tika extracted-text.xhtml tika –m extracted.met Works on Windows too (alias only on UNIX) 25-Mar-1122CORDEX-MATTMANN
23
A quick NASA dataset Atmospheric Infrared Sounder Mission (AIRS) – Level 2 Cloud Clear Radiance Product – Grab it from here: ftp://airspar1u.ecs.nasa.gov/ftp/data/s4pa/Aqua_AIRS _Level2/AIRI2CCF.003/2007/005/ ftp://airspar1u.ecs.nasa.gov/ftp/data/s4pa/Aqua_AIRS _Level2/AIRI2CCF.003/2007/005/ – Just grab the first file java -jar tika-app-0.9.jar -m < AIRS.2007.01.05.001.L2.CC.v4.0.9.0.G07006021239.hdf – Hopefully this worked for you, if not, blame Bruce And windows – And Bill Gates 25-Mar-11CORDEX-MATTMANN23
24
So you can get info from the file, what to do with it? You guys know plenty more about that than me! However… – Let’s take an example where we want to extract a time series of temp. profile information from AIRS level 2 datasets …and then, to compare it with model output from the NCAR Community Climate System Model (CCSM) Compare meaning compute some statistic, e.g., let’s say averages that we can then compare between measured and predicted values 25-Mar-11CORDEX-MATTMANN24
25
Some initial parameters AIRS Level 2 Standard Products – HDF4, with HDF-EOS metadata – Housed in several places AIRS TLSCF (JPL, Pasadena, West Coast),NASA GES DISC (Goddard, Maryland, East Coast) NCAR CCSM model output – NetCDF, with CF metadata – Housed in several places, canonical source is the Earth System Grid Lawrence Livermore National Laboratory (LLNL), Livermore, CA 25-Mar-11CORDEX-MATTMANN25
26
What’s the process? 25-Mar-11CORDEX-MATTMANN26
27
Step 1: AIRS data Decide on some set of AIRS data to select – Time bounds (e.g., January 2007) – Spatial bounds (lat lon box) Understand AIRS data – 240 files per day, broken down into 6 minute granules – Each file is in HDF4 format, with measured values for each variable part of the Level 2 std product – Understand the variable name: TAirStd 25-Mar-11CORDEX-MATTMANN27
28
Step 1a: Obtain AIRS data Some options – Go to the GES DISC and get the AIRS data from their FTP server – boo! – Get just the AIRS data you need from a web service (OPeNDAP) i.e., subset it – better! Subset out the TAirStd 45x30 matrix, and only the part of that matrix that you care about that corresponds to your spatial region of interest Requires that you know what variable is used for lat, lon, and time (stored in separate 45x30 matrices) 25-Mar-11CORDEX-MATTMANN28
29
Step 1b: So you’ve got 240 * 31 files = 7440 files Each one of these is pretty big (order of gigabytes) – Let’s assume 2 GB per file – That would mean you need ~1.5 TB of space just to get your obs data – eeep! Better idea: – Many of those 7440 files aren’t over your region of interest so discard the ones that aren’t 25-Mar-11CORDEX-MATTMANN29
30
What’s the process? 25-Mar-11CORDEX-MATTMANN30
31
Step 2 Given a subset list of those 7440 files (let’s say 1500 or so) For each file – Subset out each TAirStd 45x30 matrix from the file (and believe it or not you may not even need all of those 45 x 30 matrices either), which results in a set of data points X = (v) – Subset out lat, lon and time and shove them into the corresponding value to yield a 4-tuple X = (v, t, lat, lon) 25-Mar-11CORDEX-MATTMANN31
32
Step 2a Hidden assumption – Step 2 is easy – IT’S NOT In fact, Step 2 is usually one of the hardest parts since not all of these NASA or NOAA datasets include a subset function The datasets themselves may have different temporal properties (compared to models) – AIRS data relevant only at 1:30am and 1:30pm Different spatial properties too: 500m level 25-Mar-11CORDEX-MATTMANN32
33
Sample GHRSST L2 Data Set Image Notice that the lines of longitude and latitude are not perfectly straight. This makes it more difficult to locate equator crossings. 25-Mar-11CORDEX-MATTMANN33
34
What’s the process? 25-Mar-11CORDEX-MATTMANN34
35
Step 3 Given a set of data point tuples X = (v, t, lat, lon) – Build up a cube of the form lon by lat by time – “Regrid” the resultant satellite data onto this cube – Make this cube match up to the gridding properties of your model Maybe 1 deg by 1 deg grid box over the area that you care about Maybe daily, monthly, hourly: your model will dictate this! 25-Mar-11CORDEX-MATTMANN35
36
Step 3a Given a satellite data “regridded cube”, it’s fairly trivial to compute stats on that cube that matches up to the model – Averages/time – sum lat/lon 2d sheet for each sheet over time (the z axis in the cube) – Means/time – derive mean for lat/lon 2d sheet over time (the z axis in the cube) – Etc etc 25-Mar-11CORDEX-MATTMANN36
37
OK the schedule says I’ll talk about a tool …so OK, I’ll mention one that we are building at JPL called the Regional Climate Model Evaluation System (RCMES) – RCMED and RCMET Caveats – Certain parts of this tool are still in development – Portions of the tool are difficult to install – Things I hate love: NCL, PyNIO, PyNGL, matplotlib, scipy, numpy Good news – We’re trying to make the tool easier to install – We are building the tool as an open source system 25-Mar-11CORDEX-MATTMANN37
38
TRMM ERA -Int MODIS CRU RCMED Observation database RCMED Observation database AIRS Extractors www RCMET Evaluation tool front- end RCMET Evaluation tool front- end Model file client-side (user’s local machine) server-side (hosted at JPL) RCMED 25-Mar-1138CORDEX-MATTMANN
39
MODIS (satellite cloud fraction): [daily 2000 – 2010] TRMM (satellite precipitation): 3B42 [daily 1998– 2010] AIRS (satellite surface + profile retrievals) [daily 2002 – 2010] ERA-Interim (reanalysis): [daily 1989 – 2010] NCEP Unified Rain gauge Database (gridded precipitation): [daily 1948 – 2010] CRU TS 3.0: precipitation, Tavg, Tmax, Tmin [monthly 1901 – 2006] Level 3: T(2m), T(p), z(p) T(2m), Td(2m), T(p), z(p) Datasets included 25-Mar-1139CORDEX-MATTMANN
40
TRMM ERA -Int MODIS CRU RCMED Observation database RCMED Observation database AIRS Extractors www RCMET Evaluation toolkit RCMET Evaluation toolkit Model file client-side (user’s local machine) server-side (hosted at JPL) How do RCMET and RCMED talk? 25-Mar-1140CORDEX-MATTMANN
41
Programmatic Access The RCMED API: - Search the entire database - Space/Time box - Simple RESTful URL - Simple ASCII result format 25-Mar-1141CORDEX-MATTMANN
42
Recall: this would be what you need for step 2.5 25-Mar-11CORDEX-MATTMANN42
43
RCMED Web-Based Access The RCMED Data Portal: - Database Statistics - Project information - Advanced search options - Data product download - Query API for 3 rd Party Scripts 25-Mar-1143CORDEX-MATTMANN
44
TRMM ERA -Int MODIS CRU RCMED Observation database RCMED Observation database AIRS Extractors www RCMET Evaluation tool front- end RCMET Evaluation tool front- end Model file client-side (user’s local machine) server-side (hosted at JPL) RCMET 25-Mar-1144CORDEX-MATTMANN
45
Collect User Choices (GUI / command line) Collect User Choices (GUI / command line) Load model data Retrieve obs from database Spatial re- gridding onto common grid Time averaging Area - averaging Annual cycle compositing Metric Calculation Plot production Mod el file RCMET optional e.g. calculate monthly means from daily data e.g. calculate area- weighted mean over user defined masked region e.g. calculate means of all Januarys, all Februarys etc e.g. calculate bias, RMS error etc e.g. map, time series plot, Taylor diagram RCMED Observation database RCMED Observation database 25-Mar-1145CORDEX-MATTMANN
46
Annual cycle compositing Area-averaging: Full domain User defined lat/lon bounding box User supplied mask in netCDF file Metrics: Mean error (bias), RMS error, Mean Absolute Error, Pattern Correlation, Anomaly Correlation, Probability Distribution Function Plots: Time series Map plots Taylor Diagram What we’re working on 25-Mar-1146CORDEX-MATTMANN
47
Demo If this doesn’t work I have backup slides – Cross your fingers – And if it doesn’t work, I blame Bruce, Chris, Richard, Bill, Hassan et al. for keeping me out last night 25-Mar-11CORDEX-MATTMANN47
48
Lessons Learned Separating out RCMED and RCMET – = GOOD – Allows for each to evolve independently Keep adding satellite observations, analysis tool just reaps the benefits without having to know or care about formats, temporal differences, spatial differences, etc. RCMET installation on client machine – …ehhh, not always so good – RCMET has a tightly coupled dep on RCMED 25-Mar-11CORDEX-MATTMANN48
49
Thoughts Bandwidth limited in Africa Option 1: Couple RCMED and RCMET-like system closely together – Stand up RCMES (coupled system) – Easily add new datasets, new plots, new stats, etc. – Bandwidth limitation more easily dealt with due to closeness Option 2: Provision RCMES as a web-ui near a data center with lots of bandwidth – Allows for true “thinlet” apps, either browser or phone 25-Mar-11CORDEX-MATTMANN49
50
Alright, I’ll shut up now Any questions? THANK YOU! – mattmann@apache.org mattmann@apache.org – chris.a.mattmann@nasa.gov chris.a.mattmann@nasa.gov – @chrismattmann on Twitter @chrismattmann 25-Mar-1150CORDEX-MATTMANN
51
Acknowledgements CORDEX Team For inviting us out here, thank you! NASA Jet Propulsion Laboratory – RCMES Team – CDX Team – OODT Team Andrew Hart, Peter Lean, Cameron Goodale, Jinwon Kim, Dan Crichton, Duane Waliser, Amy Braverman 25-Mar-1151CORDEX-MATTMANN
52
Backup 25-Mar-11CORDEX-MATTMANN52
53
25-Mar-1153CORDEX-MATTMANN
54
25-Mar-1154CORDEX-MATTMANN
55
25-Mar-1155CORDEX-MATTMANN
56
25-Mar-1156CORDEX-MATTMANN
57
25-Mar-1157CORDEX-MATTMANN
58
25-Mar-1158CORDEX-MATTMANN
59
25-Mar-1159CORDEX-MATTMANN
60
25-Mar-1160CORDEX-MATTMANN
61
25-Mar-1161CORDEX-MATTMANN
62
25-Mar-1162CORDEX-MATTMANN
63
25-Mar-1163CORDEX-MATTMANN
64
25-Mar-1164CORDEX-MATTMANN
65
25-Mar-1165CORDEX-MATTMANN
66
25-Mar-1166CORDEX-MATTMANN
67
25-Mar-1167CORDEX-MATTMANN
68
25-Mar-1168CORDEX-MATTMANN
69
25-Mar-1169CORDEX-MATTMANN
70
25-Mar-1170CORDEX-MATTMANN
71
[K] 25-Mar-1171CORDEX-MATTMANN
72
[K] 25-Mar-1172CORDEX-MATTMANN
73
[K] 25-Mar-1173CORDEX-MATTMANN
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.