AGDC Current Status Challenges and Plans

Slides:



Advertisements
Similar presentations
Future Directions and Initiatives in the Use of Remote Sensing for Water Quality.
Advertisements

Integrating NOAA’s Unified Access Framework in GEOSS: Making Earth Observation data easier to access and use Matt Austin NOAA Technology Planning and Integration.
USGS Core Data Stream Report Session 7: Baseline Global Observation Scenario SDCG-7 Sydney, Australia March 4 th – 6 th 2015.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
CEOS Data Cube Concept and Prototype Project Plans
Virtual Geophysics Laboratory (VGL) VGL v1.2 NeCTAR Project Close R.Fraser, T.Rankine, J.Vote, L.Wyborn, B.Evans, R.Woodcock, C.Kemp July 2013 CSIRO |
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Australian Geoscience Data Cube
CEOS Data Cube Open Source Software Status Brian Killough CEOS Systems Engineering Office (SEO) WGISS-40 Harwell, Oxfordshire, UK September 30, 2015 (remote.
AN ORGANISATION FOR A NATIONAL EARTH SCIENCE INFRASTRUCTURE PROGRAM Virtual Geophysics Laboratory (VGL): Scientific workflows Exploiting the Cloud Josh.
CEOS Working Group on Information System and Services (WGISS) Data Access Infrastructure and Interoperability Standards Andrew Mitchell - NASA Goddard.
CEOS Data Cubes A briefing for the GEO Secretariat Brian Killough CEOS Systems Engineering Office (SEO) April 1, 2016.
The future of Delft-FEWS
What we mean by Big Data and Advanced Analytics
Incoming Themes for 2017 Frank Kelly, USGS
Future Data Access and Analysis Architectures – Discussion
NASA Earth Science Data Stewardship
2nd GEO Data Providers workshop (20-21 April 2017, Florence, Italy)
USGS EROS LCMAP System Status Briefing for CEOS
Brian Johnson and Doug Young
Database management system Data analytics system:
SEO Capacity Building Agenda #15 Brian Killough
LSI-Related Activities
INTAROS WP5 Data integration and management
Chapter 12: File System Implementation
Geoscience Australia – data cube progress
Analysis Ready Data July 18, 2016 John Dwyer Leo Lymburner
Future Data Architectures (FDA) Pilot Project Summary
Joseph JaJa, Mike Smorul, and Sangchul Song
The Swiss Data Cube (SDC)
Analysis Ready Data ..
SEO Report to WGISS Brian Killough CEOS Systems Engineering Office (SEO) WGISS-42 Meeting September 19, 2016.
Space Data Services Session 2: Space Data Country Outreach and Delivery Agenda Item #3 Brian Killough CEOS Systems Engineering Office (SEO) February 25,
AOGEOSS Task 11. Develop Regional GEOSS Data set.
USGS Status Frank Kelly, USGS EROS CEOS Plenary 2017 Agenda Item #4.14
Committee on Earth Observation Satellites
CSIRO Agency Update Dr. Robert Woodcock.
WGISS-WGCV Joint Session
Towards achieving continental scale field validation and multi-sensor interoperability of satellite derived surface reflectance in Australia Medhavy Thankappan1,
Potential Landsat Contributions
Land Imagery Data Architectures
Open Data Cubes Cloud Services Experiences and Lessons Learned
Future Data Architectures Big Data Workshop – April 2018
Shell & Kernel Concepts in Operating System
Medhavy Thankappan and Lan-Wei Wang
Technology Exploration Cloud Hosting at USGS
Data Asset Enhancement Synergies with FDA
SAR-related ARD activities
FDA Objectives and Implementation Planning
Storage Structure and Efficient File Access
What's New in eCognition 9
SESSION 2: GLOBAL DATA FLOWS STUDY
Laura Bright David Maier Portland State University
Chapter 2: Operating-System Structures
FDA Topics Going Forward…???
Introduction to Operating Systems
Agency Reports – USGS Jenn Lacey LSI-VC-5 Agenda Item #2 February 2018
Overview of Workflows: Why Use Them?
SEO Report Brian Killough, NASA, CEOS SEO CEOS Plenary 2018
Chapter 2: Operating-System Structures
Advanced Geospatial Techniques: Aiding Earth Observation Applications
What's New in eCognition 9
What's New in eCognition 9
Open Data Cube Demo and FDA Interfaces
Roadmap and short term activities on interoperability of Data Cubes
SEO Report to WGISS-48 Brian Killough
Presentation transcript:

AGDC Current Status Challenges and Plans CEOS WGISS 43 Prepared by Simon Oliver, Matt Paget, Peter Wang Presented by Robert Woodcock

In 10 minutes! Overview Open Data Cube - the code has a new home! Archive Reprocessing Collection Management Coherence / Completeness / Operations tools Product support / MODIS / rainfall / SRTM / radiometrics Pipelines beta / S2 NBAR / S1 gamma0 / Landsat SBT Temporal stacks Stackability Ingest - 3D NetCDF creation AWS Deployment - Datacube with S3 AWS Hyderabad Cube / AMI / On-demand DC / LEDAPS support Grid Workflow Temporal Statistics PiPy Communication channels Challenges In 10 minutes!

CEOS Open Data Cube The Open Data Cube, is a comprehensive solution that builds the capacity of global users to apply CEOS satellite data. Based on Analysis Ready Data (ARD) products that contribute to the increased uptake and impact of growing data volumes Coordination: GA, CSIRO, NASA, USGS. Open source software repository: https://github.com/opendatacube Current prototype testing in Colombia, Switzerland and Vietnam, with more planned. Free/Open demo available on Amazon AWS with 14 sample data cubes and 7 applications: http://www.tinyurl.com/datacubeui

Archive reprocessing Complete reprocessing of Landsat archive Level 1 NBAR / T (+T = new product ) PQ Moved scene processes to datacube Fractional Cover NDVI Dashboard for visualising data holdings and dataset metadata

Collection Management Providing users with a view of the archive content Completeness Of the potential datasets, how many exist Consistency Duplication (handling upstream production quirks) Coherence Orphan checking / descendents and ascendants Syncing to ensuring files on disk match datacube content So the datacube core currently tracks what we have. What datasets we know about (that existed at any time), where they are etc. It says what there is, but not what there should be. You can tell the datacube about a dataset (I have a level1 here->) and it will keep record of it for you. But keeping it consistent is done manually (is this replacing an old scene? Need to archive the old one when you add the new one). During tumultuous beta period for a collection, datasets can be deleted and replaced, and our diverse ecosystem of software doesn't all know to tell us that it changed the dataset. You can add three reprocessed copies of the same scene and it will happily keep track of them all for you, but we probably don’t want all three to be returned in normal search results. The cube tracks what there is, but not yet what there should be. The coherency effort is to start looking at automating the maintenance of what should be. If we have three copies of that scene, which is the one that should be used? And importantly, we’ve done this manually up to now with ad-hoc scripts, but we want something that’s trustworthy enough to run automatically and hands-off, and to remain maintainable as the system evolves. First level: Ensuring the index matches the storage (examples of edge cases we find/fix) Second: We may have multiple copies of datasets from different sources. Find the logically equivalent datasets (eg. of the same scene acquisition), and decide on which one is best. (examples of edge cases we find) Third: Going up the tree. Do all nbars have a level1, do all level1s have telemetry etc. [Going down the tree would be completion, not covered here]. Define what/where we expect of a collection. Introduced a new ’sync’ command. Designed to be hands-off, so we can run it automatically.

Collection Management Collection Management Interface Product Definition Collection Management Product version Business Managed allocation of Major Minor Patch designation based on principles of collection scope and significance evidence through assessment of impacts of change Major, Minor, Patch 1.1.1 1.2.2 x.x.x Version controlled software AGDCv2 Core Code v1.1.12 Landsat Level 2 processor v2.2.2 EO Dataset scene based packager v 2.1.1 Landsat Level 1 production system v1.1.1 LPGS v3.3.4 Version controlled inputs Version of inputs and code repositories in provenance Collection coherence validation Ancillary 1 v1.1.1 Ancillary n v1.1.1 AGDC Database Datasets v x.x.x

Product support – not data cube native Leveraging existing datasets on NCI systems MODIS MCD43A1-4 indexed BoM Rainfall indexed Radiometrics Shuttle Radar Topography Mission Feedback from users on potential new data sources Enabling alternate protocols HTTP/S3 indexing for Sentinel-2 and Landsat 8 L1T files on AWS Support for re-projection on Data Cube Load

Temporal Stacking Ingest – netCDF Generation of XYT NetCDF storage objects + chunking. important part of enhancing data access and in managing inode counts and relates to the combination of a number of datasets into a single file. The internal chunk size can be modified at stack creation time to tailor to a give mode of access (temporal or spatial) Stacking is an important part of enhancing data access and in managing inode counts and relates to the combination of a number of datasets into a single file. The internal chunk size can be modified at stack creation time to tailor to a give mode of access (temporal or spatial). With stacking comes a number of challenges in an operational setting in terms of dealing with changes to the content of that stacked data.csiro/s3-driver -> csiro/s3-driver Improvements currently in csiro/s3-driver. Improved ingester to support generation of 3D NetCDF storage objects with 3D chunking support. Keeps same indexing/database structure. One Dataset per time slice in the 3D NetCDF file. Provenance is per Dataset/timeslice in the NetCDF file. This will be extended to support s3 storage. 3D NetCDF in testing phase. 3D S3 in development.

Data Cubes and AWS with native S3 support AWS for exploratory analysis and market access Fast, scalable and accessible Address multi-platform criteria New data structures for EO Aligned with S3 principles Analytics to take advantage of I/O concurrency Data Cube with Native S3 object storage Usual Data cube API but no files. Store just the raw bytes in S3. Data Cube Storage Drivers (pluggable storage with index) Choose NetCDF storage, S3 storage, or OPeNDAP or ... Use S3 like main memory for raw array storage and access. Ingest to 3D storage objects with 3D sharding across S3 objects. Parallel reads/writes to S3 and numpy ndarray. Expected to scale linearly to network/instance(s) limit. ndarray read/write (e.g. slice) operate on S3 objects. Less translations required to read/write nd array data. Data stored on s3 is compute ready. Stored data is ready to be consumed by CPU/GPU on the low- level.

Statistics Tools Adds support for deep temporal analysis statistics Keeps full provenance of source datasets Splits processing into small spatial chunks, never be limited by available memory Out of the box support for standard statistical means/percentiles/etc as well as high dimensional stats like medoid and geomedian (soon) Has been developed in a separate repository, but will be merged into core in the next month https://github.com/GeoscienceAustralia/agdc_statistics

NBAR-T Geometric Median Murray Darling Basin 1,061,469 km2 Geometric Median Preserves the spectral character Terrain corrected It is the geometric median of Landsat 8 observations for the 2015 calendar year over the Murray Darling Basin, displayed as band 6, 5, 3 false colour (that is SWIR on red channel, NIR on green channel, Green on blue channel). The geometric median is a high dimensional median that preserved the spectral character of the observations from which they are derived and provides a summary of median conditions for the entire basin. Without ARD this would show scene edges, variance in illumination and other conditions that impact the creation of composites.

Sensor Ignorance Require no sensor understanding Spectral equivalence measures Below using wiki definitions of red, green and blue to find equivalent sensor bands (Landsat TM MODIS) Removing the requirement for sensor understanding to be able utilise data from a given platform, is seen as a major obstacle to greater adoption of remote sensing by the masses. Sensor Ignorance is a concept (still under development) to allow a user to match sensors to a given spectral response (band matching or an idealised band i.e. “blue from wikipedia”) This example demonstrates uses the datacube to match Landsat bands to the relevant MODIS bands, purely by examining and evaluation the spectral responses store in the datacube metadata.

Earth Analytics Industry Hub (CSIRO) Public good component using free satellite data A global network of interoperable data cubes Commercial component Business processes and engagement to sustain science and data pipelines “App Store” of analytics routines EO and geoscience researchers delivering to minerals, agriculture and environment domains Multi-platform Data cube applications work seamlessly across Cloud, HPC and your desktop Who wants in? We’re keen to partner with you.