Chris Lynnes, NASA Goddard Space Flight Center NASA/Goddard EARTH SCIENCES DATA and INFORMATION SERVICES CENTER (GES DISC) Analyzing.

Slides:



Advertisements
Similar presentations
Operating System.
Advertisements

Earth Science Data: Why So Chris Lynnes.
MapReduce.
Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
RAMADDA for Big Climate Data Don Murray NOAA/ESRL/PSD and CU-CIRES Boulder/Denver Big Data Meetup - June 18, 2014.
Outline of Talk Introduction Toolbox functionality Results Conclusions and future development.
Browsers and Servers CGI Processing Model ( Common Gateway Interface ) © Norman White, 2013.
OPeNDAP CODAR Combining Server Provide CODAR surface current vectors in response to request for data within a user-specified region and time range. Will.
CLIMATE SCIENTISTS’ BIG CHALLENGE: REPRODUCIBILITY USING BIG DATA Kyo Lee, Chris Mattmann, and RCMES team Jet Propulsion Laboratory (JPL), Caltech.
Chapter 1 Program Design
Basic Unix Dr Tim Cutts Team Leader Systems Support Group Infrastructure Management Team.
Systems Software Operating Systems.
Group practice in problem design and problem solving
1 The Shell and some useful administrative Unix Commands How Unix works along with some additional, useful administrative Unix commands you might need.
User Services Experience GSFC DAAC MODIS Science Team Meeting Greenbelt, MD July 2002
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
How do people communicate with computers?
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Operating System. Architecture of Computer System Hardware Operating System (OS) Programming Language (e.g. PASCAL) Application Programs (e.g. WORD, EXCEL)
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
Agenda Basic Shell Operations Standard Input / Output / Error Redirection of Standard Input / Output / Error ( >, >>,
Earth Science Division National Aeronautics and Space Administration 18 January 2007 Paper 5A.4: Slide 1 American Meteorological Society 21 st Conference.
GXP Make Kenjiro Taura. GXP make What is it? –Another parallel & distributed make What is it useful for? –Running jobs with dependencies in parallel –Running.
ATMOSPHERIC SCIENCE DATA CENTER ‘Best’ Practices for Aggregating Subset Results from Archived Datasets Walter E. Baskin 1, Jennifer Perez 2 (1) Science.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Welcome to the Goddard Earth Sciences Data and Information Services Center (GES DISC) User Working Group (UWG) Meeting May 10-11, 2011.
NetCDF Operators 1 netCDF Operators [NCO]
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
Chapter 7 File I/O 1. File, Record & Field 2 The file is just a chunk of disk space set aside for data and given a name. The computer has no idea what.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
The Operating System ICS3M.  The operating system (OS) provides a consistent environment for other software programs to execute commands.  It gives.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Composing workflows in the environmental sciences using Web Services and Inferno Jon Blower, Adit Santokhee, Keith Haines Reading e-Science Centre Roger.
ESIP Federation 2004 : L.B.Pham S. Berrick, L. Pham, G. Leptoukh, Z. Liu, H. Rui, S. Shen, W. Teng, T. Zhu NASA Goddard Earth Sciences (GES) Data & Information.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
A Portable Regional Weather and Climate Downscaling System Using GEOS-5, LIS-6, WRF, and the NASA Workflow Tool Eric M. Kemp 1,2 and W. M. Putman 1, J.
Adrianne Middleton National Center for Atmospheric Research Boulder, Colorado CAM T340- Jim Hack Running the Community Climate Simulation Model (CCSM)
March 2004 At A Glance autoProducts is an automated flight dynamics product generation system. It provides a mission flight operations team with the capability.
1 Software. 2 What is software ► Software is the term that we use for all the programs and data on a computer system. ► Two types of software ► Program.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Research & Development Building a science foundation for sound environmental decisions Remote Sensing Information Gateway (RSIG)
1 Adventures in Web Services for Large Geophysical Datasets Joe Sirott PMEL/NOAA.
SCD Research Data Archives; Availability Through the CDP About 500 distinct datasets, 12 TB Diverse in type, size, and format Serving 900 different investigators.
Welcome to the PRECIS training workshop
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
IPCC WG II + III Requirements for AR5 Data Management GO-ESSP Meeting, Paris, Michael Lautenschlager, Hans Luthardt World Data Center Climate.
ATLAS Distributed Analysis Dietrich Liko IT/GD. Overview  Some problems trying to analyze Rome data on the grid Basics Metadata Data  Activities AMI.
NetCDF Operators 1 netCDF Operators [NCO]
Program Design. Simple Program Design, Fourth Edition Chapter 1 2 Objectives In this chapter you will be able to: Describe the steps in the program development.
CS 403: Programming Languages Lecture 20 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
Lecture 10 Page 1 CS 111 Online Memory Management CS 111 On-Line MS Program Operating Systems Peter Reiher.
Using atmospheric CO2 data More than 100 CO2 records Use LMDz model as an “offline” transport tool based on “retro-plumes” (pre-calculated response function)
A41I-0105 Supporting Decadal and Regional Climate Prediction through NCAR’s EaSM Data Portal Doug Schuster and Steve Worley National Center for Atmospheric.
HDF5 OPeNDAP Project Update and Demo MuQun Yang and Hyo-Kyung Lee (The HDF Group) James Gallagher (OPeNDAP, Inc.) 1HDF and HDF-EOS Workshop XII, Aurora,
Zhong Liu George Mason University and NASA GES DISC
Chapter 11 Command-Line Master Class
Operating System.
Global Precipitation Data Access, Value-added Services and Scientific Exploration Tools at NASA GES DISC Zhong Liu1,4, D. Ostrenga1,2, G. Leptoukh4, S.
MERRA Data Access and Services
New GES DISC Services Shortening the Path in Science Data Discovery
Introduction to Computers
Haiyan Meng and Douglas Thain
Rui Wu, Jose Painumkal, Sergiu M. Dascalu, Frederick C. Harris, Jr
National Center for Atmospheric Research
Chapter Four UNIX File Processing.
More advanced BASH usage
What do these things have in common?
CSE 303 Concepts and Tools for Software Development
Presentation transcript:

Chris Lynnes, NASA Goddard Space Flight Center NASA/Goddard EARTH SCIENCES DATA and INFORMATION SERVICES CENTER (GES DISC) Analyzing a 35-Year Hourly Data Record: Why So Difficult? If Big Data is hard for data center systems, what about the average science user? Big Data at the GES DISC  The Goddard Earth Sciences Data and Information Services Center offers Earth science data to the science community  GES DISC provides the Giovanni tool for exploratory data analysis  How much data can we process for a Giovanni visualization?  We will then incorporate limits into the Giovanni User Interface to prevent workflow failures  North American Land Data Assimilation System  Land Hydrology model output  Soil moisture and temperature, runoff, evapotranspiration,…  Available in netCDF  community standard  random-access format  Archived as 1 file per timestep  ~5000 users in 2013 Big Data at the GES DISC  The Goddard Earth Sciences Data and Information Services Center offers Earth science data to the science community  GES DISC provides the Giovanni tool for exploratory data analysis  How much data can we process for a Giovanni visualization?  We will then incorporate limits into the Giovanni User Interface to prevent workflow failures  North American Land Data Assimilation System  Land Hydrology model output  Soil moisture and temperature, runoff, evapotranspiration,…  Available in netCDF  community standard  random-access format  Archived as 1 file per timestep  ~5000 users in 2013 IN21B-3704 AGU 2014 Fall Meeting San Francisco, CA A Simple, but Popular, Task: Get the time series for a point NCO (netCDF Operators): Fast tool for processing netCDF data files: Try #1: Wildcard on command line go> ncrcat -d lat,40.02, d lon, , NLDAS_NOAH0125_H_002_soilm0_100cm/*/*.nc out.nc -bash: /tools/share/COTS/nco/bin/ncrcat: Argument list too long Try #2: Read file list from standard input (NCO feature for working with lots of files) go> cat in.ncrcat | ncrcat -d lat,40.02, d lon, , o out.nc ncrcat: ERROR Total length of fl_lst_in from stdin exceeds characters. Possible misuse of feature. If your input file list is really this long, post request to developer's forum ( to expand FL_LST_IN_MAX_LNG* Try #3: Create local, short symlinks, then read from stdin ln/1979/n nc -> /var/scratch/clynnes/agu2014/data/1979/ scrubbed.NLDAS_NOAH0125_H_002_soilm0_100cm T0100.nc SUCCESS! A Simple, but Popular, Task: Get the time series for a point NCO (netCDF Operators): Fast tool for processing netCDF data files: Try #1: Wildcard on command line go> ncrcat -d lat,40.02, d lon, , NLDAS_NOAH0125_H_002_soilm0_100cm/*/*.nc out.nc -bash: /tools/share/COTS/nco/bin/ncrcat: Argument list too long Try #2: Read file list from standard input (NCO feature for working with lots of files) go> cat in.ncrcat | ncrcat -d lat,40.02, d lon, , o out.nc ncrcat: ERROR Total length of fl_lst_in from stdin exceeds characters. Possible misuse of feature. If your input file list is really this long, post request to developer's forum ( to expand FL_LST_IN_MAX_LNG* Try #3: Create local, short symlinks, then read from stdin ln/1979/n nc -> /var/scratch/clynnes/agu2014/data/1979/ scrubbed.NLDAS_NOAH0125_H_002_soilm0_100cm T0100.nc SUCCESS! Computing an Area Average Time Series (over Colorado) Pre-concatenation of data can be helpful*  Use ncrcat to concatenate hourly into daily or monthly files  But it makes some analyses (e.g. diurnal) unwieldy *Thanks to R. Strub and W. Teng for the idea Parallel Processing Made Easy Slightly Less Difficult  Split input file list into sections, process in parallel, then stitch together  GNU make: parallel job execution and control engine (-j option) for the common folk  To Run 2 jobs in parallel: make split && make –j 2 –l 9. all Computing an Area Average Time Series (over Colorado) Pre-concatenation of data can be helpful*  Use ncrcat to concatenate hourly into daily or monthly files  But it makes some analyses (e.g. diurnal) unwieldy *Thanks to R. Strub and W. Teng for the idea Parallel Processing Made Easy Slightly Less Difficult  Split input file list into sections, process in parallel, then stitch together  GNU make: parallel job execution and control engine (-j option) for the common folk  To Run 2 jobs in parallel: make split && make –j 2 –l 9. all Recommendations  To Data Product Designers / Producers  Include the Time Dimension, even for single-time-step files.  It makes reorganizing time segments easy (with NCO’s ncrcat)  To Data Centers  Provide more remote analysis capability for scientists.  GrADS Data Server:  Live Access Server:  Giovanni: Federated Giovanni in progress (Talk IN52A-05, MW-2020 Friday, 11:20)   Provide users with tips and tricks for Big Data processing.  No reason users need to re-learn our lessons the hard way.  Provide or point to tools and recipes for splitting and stitching data. Recommendations  To Data Product Designers / Producers  Include the Time Dimension, even for single-time-step files.  It makes reorganizing time segments easy (with NCO’s ncrcat)  To Data Centers  Provide more remote analysis capability for scientists.  GrADS Data Server:  Live Access Server:  Giovanni: Federated Giovanni in progress (Talk IN52A-05, MW-2020 Friday, 11:20)   Provide users with tips and tricks for Big Data processing.  No reason users need to re-learn our lessons the hard way.  Provide or point to tools and recipes for splitting and stitching data. Soil Moisture (0-100 cm) t # Makefile example: extract a time series* over Boulder, CO from hourly files SPLITFILES = $(shell split -d -l in.symlink in. && ls -c1 in.??) OUTSPLIT = $(SPLITFILES:in.%=chunk.%.nc) BBOX_ARGS = -d lat,40.02, d lon, , all:split boulder.nc $(SPLITFILES) chunk.%.nc:in.% sort $^ | ncrcat -O -o $(BBOX_ARGS) boulder.nc:$(OUTSPLIT) ncrcat -O -o $^ # Makefile example: extract a time series* over Boulder, CO from hourly files SPLITFILES = $(shell split -d -l in.symlink in. && ls -c1 in.??) OUTSPLIT = $(SPLITFILES:in.%=chunk.%.nc) BBOX_ARGS = -d lat,40.02, d lon, , all:split boulder.nc $(SPLITFILES) chunk.%.nc:in.% sort $^ | ncrcat -O -o $(BBOX_ARGS) boulder.nc:$(OUTSPLIT) ncrcat -O -o $^ 3. When all sections are done, stitch them together 1.Split list of input files (in.symlink) into chunks: in.00, in.01… 2. Extract and concatenate each chunk *Later expanded after post to developer’s forum. Spatial Resolution0.125 x deg Grid Dimensions464 lon. x 224 lat. Grid Cells per timestep103,936 Files / Day24 Days / Yr365 Years>35 Total Timesteps> 310,000 Values per Variable~ 32,000,000,000 Boulder, CO Soil Moisture cm 311,135 files, 1 per timestep 311,135 files ⇒ 21,796,250 chars 311,135 files ⇒ 7,165,105 chars (whew!) Parallel Jobs Elapsed Time (min.) netCDF-3netCDF Colorado Area Average Time Series (Monthly Files) Parallelization is not quite a silver bullet “-j 2” for 2 jobs at a time “-l 9.” defers new jobs until load drops below 9 Time Length of File Avg File Size Elapsed Time (min) ProcessingConcatenation 1 hour0.4 MB day9.1 MB month285 MB4274 Also, Divide and Conquer  Split input file list into sections, process, stitch together to reduce memory and temporary disk *The actual Makefile for Area Average Time Series is more complicated than this point time series example.