1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Slides:



Advertisements
Similar presentations
Geoinformatics 2008 Fox Semantic Provenance 1 Semantic Provenance for Image Data Processing Peter Fox (HAO/ESSL/NCAR) Deborah McGuinness (RPI) Jose Garcia,
Advertisements

The PREMIS Data Dictionary Michael Day Digital Curation Centre UKOLN, University of Bath JORUM, JISC and DCC.
NASA Earth Science Data Preservation Content Specification H. K. (Rama) Ramapriyan John Moses 10 th ESDSWG Meeting – November 2, 2011 Newport News, VA.
DRS 2 one in a series of periodic updates Harvard University Library Andrea Goethals October 21, 2009 DRS = Digital Repository Service.
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
PREMIS Implementation Fair San Francisco, CA, October Stanford Digital Repository PREMIS & Geospatial Resources Nancy J. Hoebelheinrich Knowledge.
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
Mark Evans, Tessella Digital Preservation Boot Camp – PASIG meeting, Washington DC, 22 nd May 2013 PREMIS Practical Strategies For Preservation Metadata.
NOAA Metadata Update Ted Habermann. NOAA EDMC Documentation Directive This Procedural Directive establishes 1) a metadata content standard (International.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.
1 Foundations V: Infrastructure and Architecture, Middleware Deborah McGuinness and Peter Fox CSCI Week 9, October 27, 2008.
05 December, 2002HDF & HDF-EOS Workshop VI1 SEEDS Standards Process Richard Ullman SEEDS Standards Formulation Team Lead
Describing Methodologies PART II Rapid Application Development*
1 Class Exercise I: Use Cases Deborah McGuinness and Peter Fox (NCAR) CSCI Week 4 (part II), 2008.
Final Search Terms: Archiving (digital or data) Authentication (data) Conservation (digital or data) Curation (digital or data) Cyberinfrastructure Data.
1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 11, November 15, 2011 Data Workflow Management, Data Stewardship.
World Data Center for Human Interactions in the Environment Conducting a Self-Assessment of a Long-Term Archive for Interdisciplinary Scientific Data as.
ACCESS for VALIDITY ACCESS for INNOVATION. Starting January 2011 for NEW proposals Not voluntary – “integral part” of proposal and FastLane Required for.
1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 6, October 5, 2010 Introduction to Data Mining.
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, Stanford Digital Repository PREMIS & Geospatial Resources Nancy J. Hoebelheinrich InfoAnalytics.
Elements of a Data Management Plan Bill Michener University Libraries University of New Mexico Data Management Practices for.
1 Foundations V: Infrastructure and Architecture, Middleware Deborah McGuinness TA Weijing Chen Semantic eScience Week 10, November 7, 2011.
1 Foundations V: Infrastructure and Architecture, Middleware Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI Week 10, November.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Ellsworth LeDrew, University of Waterloo of-ipy// Mark Parsons Taco de Bruin.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Archival Information Packages for NASA HDF-EOS Data R. Duerr, Kent Yang, Azhar Sikander.
Towards Long-Term Archiving of NASA HDF-EOS and HDF Data Data Maps and the Use of Mark-Up Language Ruth Duerr, Mike Folk, Muqun Yang, Chris Lynnes, Peter.
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps Ruth Duerr, NSIDC Christopher Lynnes, GES DISC The HDF Group Oct HDF and.
Creating Archive Information Packages for Data Sets: Early Experiments with Digital Library Standards Ruth Duerr, NSIDC MiQun Yang, THG Azhar Sikander,
1 Foundations VI: Discovery, Access and Semantic Integration Data Mining and Knowledge Discovery - Continued Deborah McGuinness and Joanne Luciano with.
Provenance & Context Workshop - Guiding Documents.
OCLC Online Computer Library Center Preservation Metadata Standards PREMIS & METS Taylor Surface, OCLC.
1 Semantic Provenance and Integration Peter Fox and Deborah L. McGuinness Joint work with Stephan Zednick, Patrick West, Li Ding, Cynthia Chang, … Tetherless.
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
PREMIS Implementation Fair, San Francisco, CA October 7, Stanford Digital Repository PREMIS & Geospatial Resources Nancy J. Hoebelheinrich Knowledge.
The Digital Library for Earth System Science: Contributing resources and collections GCCS Internship Orientation Holly Devaul 19 June 2003.
User Working Group 2013 Data Access Mechanisms – Status 12 March 2013
Cyberinfrastructure What is it? Russ Hobby Internet2 Joint Techs, 18 July 2007.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Symposium on Global Scientific Data Infrastructures Panel Two: Stakeholder Communities in the DWF Ann Wolpert, Massachusetts Institute of Technology Board.
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
July 20, Update on the HDF5 standardization effort Elena Pourmal, Mike Folk The HDF Group July 20, 2006 SPG meeting, Palisades, NY.
PREMIS Data Dictionary and the Future of Preservation Metadata Brian Lavoie Research Scientist OCLC Research Society of American Archivists.
1 Class exercise II: Use Case Implementation Deborah McGuinness and Peter Fox CSCI Week 8, October 20, 2008.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
Improving long-term preservation EOS data by independently mapping HDF4 data objects Mike Folk, Ruth Aydt, Peter Cao, Kent Yang Ruth Duerr Christopher.
Preservation Metadata Initiatives: Status and Direction Brian Lavoie Senior Research Scientist Office of Research OCLC Archiving Web Resources Canberra.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
SEDAC Long-Term Archive Development Robert R. Downs Socioeconomic Data and Applications Center Center for International Earth Science Information Network.
Social and Personal Factors in Semantic Infusion Projects Patrick West 1 Peter Fox 1 Deborah McGuinness 1,2
ISWG / SIF / GEOSS OOSSIW - November, 2008 GEOSS “Interoperability” Steven F. Browdy (ISWG, SIF, SCC)
An Introduction to PREMIS Jenn Riley Metadata Librarian IU Digital Library Program.
PDS4 Project Report PDS MC F2F University of Maryland Dan Crichton March 27,
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
Annotating and Embedding Provenance in Science Data Repositories to Enable Next Generation Science Applications Deborah L. McGuinness.
Fedora Commons Overview and Background Sandy Payette, Executive Director UK Fedora Training London January 22-23, 2009.
1 Peter Fox Data Science – ITEC/CSCI/ERTH-4350/6350 Week 10, November 5, 2013 Data Workflow Management, Data Preservation and Stewardship.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
NASA Earth Science Data Stewardship
Persistent Identifiers Implementation in EOSDIS
Summit 2017 Breakout Group 2: Data Management (DM)
Active Data Management in Space 20m DG
A Case Study for Synergistically Implementing the Management of Open Data Robert R. Downs NASA Socioeconomic Data and Applications.
Bird of Feather Session
Fundamental Science Practices (FSP) of the U.S. Geological Survey
Presentation transcript:

1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI Week 13, November 29, 2010

Contents Review assignment More advanced topics; life cycle, mining and adding to your knowledge base Summary Next week (your presentations) 2

3 Semantic Web Methodology and Technology Development Process Establish and improve a well-defined methodology vision for Semantic Technology based application development Leverage controlled vocabularies, et c. Use Case Small Team, mixed skills Analysis Adopt Technology Approach Leverage Technology Infrastructure Rapid Prototype Open World: Evolve, Iterate, Redesign, Redeploy Use Tools Science/Expert Review & Iteration Develop model/ ontology Evaluation

Data->Information->Knowledge 4

Data Life Cycle Life cycle (we will define these shortly) –Acquisition, curation, preservation –Long term stewardship Data and information – we use this to get to the discussion of knowledge –Content; the values –Context; the background, setting, etc. –Structure; organization and form Representation/ storage –Analog –Digital (and born digital) 5

Why it is important 1976 NASA Viking mission to Mars (A. Hesseldahl, Saving Dying Data, Sep. 12, 2002, Forbes. [Online]. Available: BBC Digital Domesday (A. Jesdanun, “Digital memory threatened as file formats evolve,” Houston Chronicle, Jan. 16, [Online]. Available: R. Duerr, M. A. Parsons, R. Weaver, and J. Beitler, “The international polar year: Making data available for the long- term,” in Proc. Fall AGU Conf., San Francisco, CA, Dec [Online]. Available: ftp://sidads.colorado.edu/pub/ppp/conf_ppp/Duerr/The_Inter national_Polar_Year:_Making_Data_and_Information_Availa ble_for_the_Long_Term.ppt ftp://sidads.colorado.edu/pub/ppp/conf_ppp/Duerr/The_Inter national_Polar_Year:_Making_Data_and_Information_Availa ble_for_the_Long_Term.ppt 6

Why (cont’d) e-science aims to derive new knowledge from (possibly) multiple sources data The data needs to be persistent, available and usable The rate of creation of knowledge representations is increasing; they are a representation of the known ‘facts’ based on the data We studied KR creation, engineering, evolution and iteration Knowledge needs a life-cycle as well 7

At the heart of it Inability to read the underlying sources, e.g. the data formats, metadata formats, knowledge formats, etc. Inability to know the inter-relations, assumptions and missing information We’ll look at a (data) use case for this shortly But first we will look at what, how and who in terms of the full life cycle 8

What to collect? Documentation –Metadata –Provenance Ancillary Information Knowledge 9

Who does this? Roles: –Data creator –Data analyst –Data manager –Data curator 10

How it is done 11

Acquisition 12

Curation 13

Preservation Usually refers to the full life cycle Archiving is a component Stewardship is the act of preservation Intent is that ‘you can open it any time in the future’ and that ‘it will be there’ This involves steps that may not be conventionally thought of Think 10, 20, 50, 200 years…. looking historically gives some guide to future considerations 14

Some examples and experience NASA NOAA Library community Note: –Mostly in relation to publications, books, etc but some for data –Note that knowledge is in publications but the structure form is meant for humans not computers, despite advances in text analysis –Very little for the type of knowledge we are considering: in machine accessible form 15

Back in the day... SEEDS Working Group on Data Lifecycle Second Workshop Report o o Many LTA recommendations Earth Sciences Data Lifecycle Report o o Many lessons learned from USGS experience, plus some recommendations SEEDS Final Report (2003) - Section 4 o o Final recommendations vis a vis data lifecycle MODIS Pilot Project GES DISC, MODAPS, NOAA/CLASS, ESDIS effort Transferred some MODIS Level 0 data to CLASS

Mostly Technical Issues Data Preservation o Bit-level integrity o Data readability Documentation Metadata Semantics Persistent Identifiers Virtual Data Products Lineage Persistence Required ancillary data Applicable standards

Mostly Non-Technical Issues Policy (constrained by money…) Front end of the lifecycle o Long-term planning, data formats, documentation... Governance and policy Legal requirements Archive to archive transitions Money (intertwined with policy) Cost-benefit trades Long-term needs of NASA Science Programs User input o Identifying likely users Levels of service Funding source and mechanism

HDF4 Format "Maps" for Long Term Readability C. Lynnes, GES DISC R. Duerr and J. Crider, NSIDC M. Yang and P. Cao, The HDF Group Use case: a real live one; deals mostly with structure and (some) content HDF=Hierarchical Data Format NSIDC=National Snow and Ice Data Center GES=Goddard Earth Science DISC=Data and Information Service Center

In the year A user of HDF-4 data will run into the following likely hurdles: The HDF-4 API and utilities are no longer supported... o...now that we are at HDF-7 The archived API binary does not work on today's OS's o...like Android 3.1 The source does not compile on the current OS o...or is it the compiler version, gcc v. 7.x? The HDF spec is too complex to write a simple read program... o...without re-creating much of the API What to do?

HDF Mapping Files Concept: create text-based "maps" of the HDF-4 file layouts while we still have a viable HDF-4 API (i.e., now) XML Stored separately from, but close to the data files Includes o internal metadata o variable info o chunk-level info  byte offsets and length  linked blocks  compression information Task funded by ESDIS project The HDF Group, NSIDC and GES DISC

Map sample (extract)

Status and Future Status Map creation utility (part of HDF) Prototype read programs o C o Perl Paper in TGRS special issue Inventory of HDF-4 data products within EOSDIS Possible Future Steps Revise XML schema Revise map utility and add to HDF baseline Implement map creation and storage operationally o e.g., add to ECS or S4PA metadata files

Examples of NASA context 24

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Contextual Information: Instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, noise characteristics, etc.) Instrument/sensor calibration data and method Processing algorithms and their scientific basis, including complete description of any sampling or mapping algorithm used in creation of the product (e.g., contained in peer-reviewed papers, in some cases supplemented by thematic information introducing the data set or derived product) Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product 25 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Contextual Information (continued): Processing history including versions of processing source code corresponding to versions of the data set or derived product held in the archive Quality assessment information Validation record, including identification of validation data sets Data structure and format, with definition of all parameters and fields In the case of earth based data, station location and any changes in location, instrumentation, controlling agency, surrounding land use and other factors which could influence the long-term record A bibliography of pertinent Technical Notes and articles, including refereed publications reporting on research using the data set Information received back from users of the data set or product 26 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

However… Even groups like NASA do not have a governance model for this work Governance: defintion Stakeholders: –NASA for integrity of their data holdings (is it their responsibility?) –Public for value for and return on investment –Scientists for future use (intended and un- intended) –Historians 27

NOAA 28

Library community OAIS OAI (PMH and ORE) 29

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group Metadata Standards - PREMIS Provide a core preservation metadata set with broad applicability across the digital preservation community Developed by an OCLC and RLG sponsored international working group –Representatives from libraries, museums, archives, government, and the private sector. Based on the OAIS reference model

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group Metadata Standards - PREMIS Maintained by the Library of Congress Editorial board with international membership User community consulted on changes through the PREMIS Implementers Group Version 1 was released in June 2005 Version 2 was just released

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group Rights Events Agents “a coherent set of content that is reasonably described as a unit” For example, a web site, data set or collection of data sets “a coherent set of content that is reasonably described as a unit” For example, a web site, data set or collection of data sets “a discrete unit of information in digital form” For example, a data file “a discrete unit of information in digital form” For example, a data file “assertions of one or more rights or permissions pertaining to an object or an agent” e.g., copywrite notice, legal statute, deposit agreement “assertions of one or more rights or permissions pertaining to an object or an agent” e.g., copywrite notice, legal statute, deposit agreement “an action that involves at least one object or agent known to the preservation repository” e.g., created, archived, migrated “an action that involves at least one object or agent known to the preservation repository” e.g., created, archived, migrated “a person, organization, or software program associated with preservation events in the life of an object” e.g., Dr. Spock donated it PREMIS - Entity-Relationship Diagram Intellectual Entities Objects

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group PREMIS - Types of Objects Representation - “the set of files needed for a complete and reasonable rendition of an Intellectual Entity” File Bitstream - “contiguous or non-contiguous data within a file that has meaningful common properties for preservation purposes”

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group Metadata Standards - METS Metadata Encoding and Transmission Standard An initiative of the Digital Library Federation Based on the Making of America II project

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group METS - What’s Its Purpose? Provides the means to convey the metadata necessary for –management of digital objects within a repository –exchange of objects between repositories (or between repositories and their users) Designed to facilitate –shared development of information management tools/services –interoperable exchange of digital materials

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group METS - What’s its status? Version 1.6 was released in Sept Maintained by the Library of Congress International Editorial Board NISO registration as of 2006

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group Backup Materials - MODIS Contextual Info

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Instrument/sensor characteristics 38 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Processing Algorithms & Scientific Basis 39 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Ancillary Data 40 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Processing History including Source Code 41 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Quality Assessment Information 42 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Validation Information 43 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Other Factors that can Influence the Record 44 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign Bibliography 45 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group Information from users Data Errors found Quality updates Things that need further explanation Metadata updates/additions? Community contributed metadata????

Back to why you need to… E-science uses data and it needs to be around when what you create goes into service and you go on to something else That’s why someone on the team must address life-cycle (data, information and knowledge – we’ll get to the latter shortly) and work with other team members to implement organizational, social and technical solutions to the requirements 47

What would you need to do? 48

(Digital) Object Identifiers Object is used here so as not to pre-empt an implementation, e.g. resource, sample, data, catalog Examples: –DOI –URI –XRI 49

Versioning 50

Mining We will start with data but the ideas apply to information and knowledge bases as well Definition History Our interest 51

SAM: Smart Assistant for Earth Science Data Mining PI: Rahul Ramachandran Co-I: Peter Fox, Chris Lynnes, Robert Wolf, U.S. Nair

Science Motivation Study the impact of natural iron fertilization process such as dust storm on plankton growth and subsequent DMS production –Plankton plays an important role in the carbon cycle –Plankton growth is strongly influenced by nutrient availability (Fe/Ph) –Dust deposition is important source of Fe over ocean –Satellite data is an effective tool for monitoring the effects of dust fertilization Analysis entails –Mine MODIS L1B data for dust storm events and identify the swath of area influenced by the passage of the dust storms. –Examine correlations between fertilization, plankton growth and DMS production

Current Analysis Process MODIS aerosol products don’t provide speciation Locate and download all the data to their local machine Write code to classify and detect dust accurately [ 3-4 month effort] Write code to classify and detect other dust aerosols [ 3- 4 month effort] Write code to segment the detected region in order to account for advection effect and correlation coefficient [2 months effort]

Analysis with SAM Create a workflow to perform classification using many different state of the art classifiers on distributed data Create a workflow to segment detected regions using image processing services on distributed data Bottom line: Scientist does not have to write all the code to perform the analysis Can compose workflows that utilize distributed data/services Can share the workflow with others to collaborate, reuse and modify

Conducting Science using Internet as the Primary Computer

Mash-ups Example: Yahoo Pipes

Data Mining in the ‘new’ Distributed Data/Services Paradigm

Too many choices!! And that’s only part of the toolkit ADaM-IVICS toolkit has over 100+ algorithms

SAM Objectives Improve usability of Earth Science data by existing data mining services for research, by incorporating semantics into the workflow composition process. –Semantic search capable of mapping a conceptual task –Assistance in mining workflow composition –Verification that services are connected in a semantically correct fashion

Ontology Use

Semi-automated Workflow Composition Filtering services based on data format

Semi-automated Workflow Composition Filtering service options based on both data format and task selected

Semi-automated Workflow Composition Final Workflow

Science Motivation Study the impact of natural iron fertilization process such as dust storm on plankton growth and subsequent DMS production –Plankton plays an important role in the carbon cycle –Plankton growth is strongly influenced by nutrient availability (Fe/Ph) –Dust deposition is important source of Fe over ocean –Satellite data is an effective tool for monitoring the effects of dust fertilization

Hypothesis In remote ocean locations there is a positive correlation between the area averaged atmospheric aerosol loading and oceanic chlorophyll concentration There is a time lag between oceanic dust deposition and the photosynthetic activity

Primary source of ocean nutrients WIND BLOWNDU ST SAHAR A SEDIMENTS FROM RIVER OCEAN UPWELLI NG

SAHAR A DUST SST CLOU DS NUTRIE NTS CHLOROPH YLL Factors modulating dust-ocean photosynthetic effect

Objectives Use satellite data to determine, if atmospheric dust loading and phytoplankton photosynthetic activity are correlated. Determine physical processes responsible for observed relationship

Preliminary Results

Data and Method Data sets obtained from SeaWiFS and MODIS during 2000 – 2006 are employed MODIS derived AOT

The areas of study Tropical North Atlantic Ocean 2-West coast of Central Africa 3- Patagonia 4-South Atlantic Ocean 5-South Coast of Australia 6-Middle East 7- Coast of China 8-Arctic Ocean *Figure: annual SeaWiFS chlorophyll image for 2001

Tropical North Atlantic Ocean  dust from Sahara Desert Chlorophyll AOT

Arabian Sea  Dust from Middle East Chlorophyll AOT

Summary and future work Dust impacts oceans photosynthetic activity, positive correlations in some areas NEGATIVE correlation in other areas, especially in the Saharan basin Hypothesis for explaining observations of negative correlation: In areas that are not nutrient limited, dust reduces photosynthetic activity But also need to consider the effect of clouds, ocean currents. Also need to isolate the effects of dust. MODIS AOT product includes contribution from dust, DMS, biomass burning etc.

Case for SAM MODIS aerosol products don’t provide speciation Why performing this data analysis is hard? –Need to classify and detect Dust accurately –Need to classify and detect other aerosols (eg. DMS accurately) –Need to segment the detected region in order to account for advection effects and correlation coefficient. What will SAM provide? –Provide capability to create a workflow to perform classification –Provide capability to create a workflow to segment detected regions Bottom line: Scientist does not have to write all the code to perform the analysis Can compose workflows that utilize distributed data/services Can share the workflow with others to collaborate, reuse and modify

Knowledge Discovery Has a broad meaning –Finding ontologies –Creating new knowledge from Previous knowledge New sources (data, information) Modeling We’ll look at a mining approach as an example 77

78 Ingest/pipelines: problem definition Data is coming in faster, in greater volumes and outstripping our ability to perform adequate quality control Data is being used in new ways and we frequently do not have sufficient information on what happened to the data along the processing stages to determine if it is suitable for a use we did not envision We often fail to capture, represent and propagate manually generated information that need to go with the data flows Each time we develop a new instrument, we develop a new data ingest procedure and collect different metadata and organize it differently. It is then hard to use with previous projects The task of event determination and feature classification is onerous and we don't do it until after we get the data

Fox VSTO et al. 79

80 Who (person or program) added the comments to the science data file for the best vignetted, rectangular polarization brightness image from January, 26, :09UT taken by the ACOS Mark IV polarimeter? What was the cloud cover and atmospheric seeing conditions during the local morning of January 26, 2005 at MLSO? Find all good images on March 21, Why are the quick look images from March 21, 2008, 1900UT missing? Why does this image look bad? Use cases

Fox VSTO et al. 81

Fox VSTO et al. 82

Summary (Data) life cycle – key actions –A –B Mining (data, information and knowledge) – key results and work in progress –A –B Facilitating new discoveries –A 83

Next week This weeks assignments: –Reading: None –Assignment: None Next class (week 14 – December 6): –Class presentation III: Use case iteration Term assignment due – December 6 before class Office hours this week – by appointment or drop in –Winslow 2104 (Professor McGuinness) –Winslow 2143 (Professor Luciano) Questions? 84