Creating a Data Mining Environment for Geosciences

Slides:



Advertisements
Similar presentations
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Advertisements

MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
CPSC 695 Future of GIS Marina L. Gavrilova. The future of GIS.
Data Mining – Intro.
Ames Research Center NAS Division 1 Experience With NASA’s Grid Miner Thomas H. Hinke NASA Ames Research Center Moffett Field, California, USA.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
UAH GRIDS Center Middleware Testing Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.
Data Mining Techniques
V. Chandrasekar (CSU), Mike Daniels (NCAR), Sara Graves (UAH), Branko Kerkez (Michigan), Frank Vernon (USCD) Integrating Real-time Data into the EarthCube.
Focus Study: Mining on the Grid with ADaM Sara Graves Sandra Redman Information Technology and Systems Center and Information Technology Research Center.
The Natural Resources Digital Library Needs, Partners, and Challenges Bonnie Avery, Janine Salwasser, & Janet Webster Oregon State University.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Data Mining in Earth Sciences Rahul Ramachandran, Sara Graves and Ken Keiser Mathematical Challenges in Scientific Data Mining IPAM January 14-18, 2002.
Chapter 1 Introduction to Data Mining
University of Alabama in Huntsville NMI Testing and Experiences Sandra Redman Information Technology and Systems Center and Information Technology Research.
Grid-enabled Collaborative Research Applications Internet2 Member Meeting Spring, 2003 Sara J. Graves Director, Information Technology and Systems Center.
material assembled from the web pages at
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Ames Research Center NAS Division 1 A Data Miner for the Information Power Grid Thomas H. Hinke NASA Ames Research Center Moffett Field, California, USA.
On-Board Mining in the Sensor Web NSF Next Generation Data Mining November 2, 2002 Dr. Rahul Ramachandran For Steve Tanner and.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
ESIP Federation 2004 : L.B.Pham S. Berrick, L. Pham, G. Leptoukh, Z. Liu, H. Rui, S. Shen, W. Teng, T. Zhu NASA Goddard Earth Sciences (GES) Data & Information.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, March 29, 2000.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
ITSC/University of Alabama in Huntsville ADaM System Architecture Rahul Ramachandran, Sara Graves and Ken Keiser Mathematical Challenges in Scientific.
ESML, Subsetting, Mining Tools Sara Graves Rahul Ramachandran Information Technology and Systems Center (ITSC) University of Alabama in Huntsville (UAH)
© 2013, published by Flat World Knowledge Chapter 10 Understanding Software: A Primer for Managers 10-1.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
OSSIM Technology Overview Mark Lucas. “Awesome” Open Source Software Image Map (OSSIM)
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
Data Mining - Introduction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.
Dr.S.Sridhar,Ph.D., RACI(Paris),RZFM(Germany),RMR(USA),RIEEEProc.
ESML, Subsetting, Mining Tools
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
IOT – Firefighting Example
Data Mining – Intro.
SNS COLLEGE OF TECHNOLOGY
What Is Cluster Analysis?
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
Joseph JaJa, Mike Smorul, and Sangchul Song
J. Michael, M. Shing M. Miklaski, J. Babbitt Naval Postgraduate School
Introduction C.Eng 714 Spring 2010.
Data Quality: Practice, Technologies and Implications
DEFECT PREDICTION : USING MACHINE LEARNING
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
What's New in eCognition 9
Data Mining: Concepts and Techniques
Course Introduction CSC 576: Data Mining.
Data Mining: Introduction
Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Bird of Feather Session
Data Mining: Concepts and Techniques
Big DATA.
AI Discovery Template IBM Cloud Architecture Center
What's New in eCognition 9
Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
What's New in eCognition 9
Presentation transcript:

Creating a Data Mining Environment for Geosciences Interface 2002 Montreal April 18, 2002 Sara J. Graves Director, Information Technology and Systems Center Professor, Computer Science Department University of Alabama in Huntsville Director, Information Technology and Research Center National Space Science and Technology Center 256-824-6064 sgraves@itsc.uah.edu www.itsc.uah.edu

Characteristics of Science Data Varied kinds of data Raster images With structure and geometry Multispectral Time series and sequence data Numerical model outputs Multiple resolutions/multiple scales Variability of data formats Granularity of data Includes spatial and temporal dimensions Physical basis/domain knowledge needed before applying algorithms Typically requires domain-specific algorithms

Scientific Analysis Harnesses human analysis capabilities Highly creative Based on theory and hypothesis formulation Physical basis is normally used for algorithms Drawing insights about the underlying phenomena Rapidly widening gap between data collection capabilities and the ability to analyze data Potential of vast amounts of data to be unused

Data Mining Provides automation of the analysis process Can be used for dimensionality reduction when manual examination of data is impossible Can have limitations May not utilize domain knowledge May be difficult to prove validity of the results There may not be a physical basis Should be viewed as complementary tool and not a replacement for scientific analysis

Reasons for Mining Science Data Powerful tool for research and analysis given the volume of science data Necessity when manual examination of data is impossible Can allow scientists to refine/add more layers to the knowledge bases Can minimize scientists’ data handling to allow them to maximize research time Can reduce “reinventing the wheel” Can fully exploit reusable knowledge bases for different problems Can be integrated into a Next Generation Information System to provide additional services such as: Custom Order Processing Subsetting/Formatting/Gridding …. Event/Relationship Searching

Similarity between Data Mining and Scientific Analysis Process

Data Challenge Data Preparation for Mining/Analysis Results Search and Access Data Data Integration Data Transformation Data Reduction Data Mining Science Analysis Open your presentation with an attention-getting incident. Choose an incident your audience relates to. The incidence is the evidence that supports the action and proves the benefit. Beginning with a motivational incident prepares your audience for the action step that follows. Results Data Preparation for Mining/Analysis Data Sets

Typical Data Preparation Operations Data Cleaning Clean data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. Fairly well handled Data Integration Integration of multiple data files Data Transformations Normalization and aggregation Data Reduction Obtain a reduced representation of the data set, which produces the same analytical results

User Perspective and Data Perspective of the Data Mining Process Analysis Decision Value Volume Transformation Knowledge Preprocessing Information Dataset Specific Algorithms Domain Specific Algorithms Open your presentation with an attention-getting incident. Choose an incident your audience relates to. The incidence is the evidence that supports the action and proves the benefit. Beginning with a motivational incident prepares your audience for the action step that follows. Data Calibration & Navigation Data Stores User Perspective Dataset Data Perspective

Scientific Data Mining Environment Stakeholders End Users Scientists

Scientist’s Perspective Define the experiment Reduce data volume Create reusable “Knowledge Base” Iterate over experiment to refine the knowledge base Minimize data handling/Maximize research Add more “layers” to the knowledge base Allow different levels of knowledge discovery: Shallow knowledge Hidden Deep

End User’s Perspective End users can be: Students Public Decision makers Other Scientists Access to data Access to knowledge base End products

NASA Workshop on Issues of Application of Data Mining to Scientific Data Held on October 19-21, 1999 at University of Alabama in Huntsville Domain Focus: Global Change Natural Hazard Terrestrial Ecology Key Recommendations Need to create a data mining environment for facilitation, scalability and automation of scientific analysis for large scale data streams Need to formulate critical partnerships between physical scientists, computer scientists and statisticians for an effective integration of analysis processes, scientific algorithms, statistical approaches and enabling computer architectures

Reasons for Building a Data Mining Environment Provide the capabilities and flexibility of creative scientific analysis Provide an infrastructure of mining algorithms and knowledge bases for creative analysis to reduce “reinventing the wheel” Provide capabilities to add “science algorithms” to the framework Support a spectrum of heterogeneous participants, data sources and technological approaches Provide a framework with components and suitable management of the interfaces between them Allow scientists to refine/add domain information to the mining environment Minimize scientists’ data handling to allow them to maximize research time Incorporate relevance feedback mechanism to learn methodologies for multiple domains Integrate data mining functionality into other distributed systems

Mining Environment: When,Where, Who and Why? User Workstation Data Mining Center GRID WHEN Real Time On-Ingest On-Demand Repeatedly WHO End Users Domain Experts Mining Experts WHY Event Relationship Association Corroboration Collaboration Data Mining

ADaM History Algorithm Development and Mining (ADaM) System ADaM system developed under NASA HQ research grant The system provides knowledge discovery, feature detection and content-based searching for data values, as well as for metadata. It contains over 120 different operations to be performed on the input data stream. Operations vary from specialized atmospheric science data-set specific algorithms to different digital image processing techniques, processing modules for automatic pattern recognition, machine perception, neural networks and genetic algorithms. Developed an Event/Relationship Search System

ADaM Engine Architecture Preprocessed Data Patterns/ Models Results Data Translated Data Processing Input HDF HDF-EOS GIF PIP-2 SSM/I Pathfinder SSM/I TDR SSM/I NESDIS Lvl 1B SSM/I MSFC Brightness Temp US Rain Landsat ASCII Grass Vectors (ASCII Text) Intergraph Raster Others... Preprocessing Analysis Output GIF Images HDF-EOS HDF Raster Images HDF SDS Polygons (ASCII, DXF) SSM/I MSFC Brightness Temp TIFF Images Others... Selection and Sampling Subsetting Subsampling Select by Value Coincidence Search Grid Manipulation Grid Creation Bin Aggregate Bin Select Grid Aggregate Grid Select Find Holes Image Processing Cropping Inversion Thresholding Others... Clustering K Means Isodata Maximum Pattern Recognition Bayes Classifier Min. Dist. Classifier Image Analysis Boundary Detection Cooccurrence Matrix Dilation and Erosion Histogram Operations Polygon Circumscript Spatial Filtering Texture Operations Genetic Algorithms Neural Networks Others...

Extensibility of ADaM Training Data Input Visualization/ Analysis General Purpose Algorithms Trainable Classifiers User Defined Training Data Input ADaM Mining Engine Input Modules Analysis Modules Output Modules Visualization/ Analysis Packages

Data Mining Environment Distributed Clients Web-based Workstation based Other Systems Common Client API Analysis/Vis Tools Knowledge Base Data Mining Server Mining Results Mining Engine (ADaM) Analysis Modules Input Output Event/ Relationship Search System Data Stores

Event/Relationship Search System Allows users to conduct coincidence searches and relationship tests between mined phenomena and a variety of parameters Parameters include geographic regions, political boundaries, or other named phenomena for a specific time period

An Environment for On-board processing: EVE Real-time on-board data mining can provide unique capabilities Anomaly detection Autonomous control and decision making Immediate response Direct satellite to Earth delivery of results The Sensor Web is expanding and more processing is available on-orbit 12/2/2018

Sensor Data Simulations Major EVE Components EVE Software Architecture Output Modules Analysis Input XML Based Processing Plans Inter Process Communcation Decision Support Processing Plan Editor etc. Sensor Model Library On-board Configuration Sensor Data Simulations IR Passive Microwave Testbed of On-board Systems Flight Linux RT Java Control Systems Ground Control Testbed System Specific Cross-Compiler (e.g. VXWorks)

Interchange Technology: Earth Science Markup Language (ESML) Facilitate effective utilization of distributed, heterogeneous data products Enable interchangeable tools and services Data Applications Interchange Technology Integration

Interoperability: Accessing Heterogeneous Data Earth science data comes in: Different formats, types and structures Different states of processing (raw, calibrated, derived, modeled or interpreted) Enormous volumes One approach: Standard data formats Difficult to implement and enforce Can’t anticipate all needs Some data can’t be modeled or is lost in translation The cost of converting legacy data A better approach: Interchange technologies Earth Science Markup Language HDF-EOS HDF netCDF ASCII GRIB Binary

Data Usability The Problem The Solution ESML FILE LIBRARY APPLICATION DATA FORMAT 1 FORMAT 2 FORMAT 3 DATA FORMAT 1 DATA FORMAT 2 DATA FORMAT 3 FORMAT CONVERTER READER 1 READER 2 APPLICATION Specialized code for every format Difficult to assimilate new data types Expensive to convert legacy data Data interoperability “Define Once, Use Anywhere”

Earth Science Markup Language Specialized markup language for Earth Science information based on XML Beyond traditional metadata to include both structural and semantic information needed to effect a practical runtime interpretation of a data set Benefits of ESML: Enables independently developed applications and services to effectively utilize distributed, heterogeneous data products Allows the end-user to integrate data sets of differing structures to aid in data fusion and analysis without having to write a special reader for each data set Is simple enough that end-users can create their own ESML for on- hand datasets (new or legacy)

Example: ESML file for an Image Hurricane Mitch (981027) Binary format (32 bits/pixel) Mitch.esml <ESML> <SyntacticMetaData> <Binary GeoInfo=“NoGeoInfo”> <Array name=“Track” occurs=“300“ DimName=“Y“> <Field name=“Channel 1” type=“int” size=“32” occurs=“512” DataType=“Data” DimName=“X”/> <Data/> </Array> </Binary> </SyntacticMetaData> </ESML> 512 300

Atmospheric Science Mining Applications Lightning Detection Rainfall Identification & Estimation Study Rainfall Accumulation Study Tropical Cyclone Detection and Wind Speed Estimation GOES Cumulus Cloud Classification Mesoscale Convective System Detection Detection of Jet Streams in Numerical Model Data

Tropical Cyclone Detection: Estimating Maximum Wind Speed Advanced Microwave Sounding Unit (AMSU-A)Data Water cover mask to eliminate land Laplacian filter to compute temperature gradients Science Algorithm to estimate wind speed Contiguous regions with wind speeds above a desired threshold identified Additional test to eliminate false positives Maximum wind speed and location produced Calibration/ Limb Correction/ Converted to Tb Hurricane Floyd Data Archive Mining Environment Results are placed on the web and made available to National Hurricane Center & Joint Typhoon Warning Center Result

Data Fusion and Mining: From Global Information to Local Knowledge Emergency Response Precision Agriculture Urban Environments Weather Prediction

Mining as a Web Service

Mining on Information Power Grid (IPG) using ADaM Data Archive X IPG Mining Agent IPG Processor Mining Daemon Control Database IPG Processor Mining Operations Repository IPG Processor IPG Processor Mining LDAP Server IPG Mining Agent IPG Processor Satellite Data Archive Y

Earth Science Example of Developing a Knowledge Network: Collaborative Research in Mesoscale Convective Systems Knowledge Base Information about MCSs detected Data Sets SSM/I (F13, F14) Visualization Eureka Interface ADaM System Database location size intensity etc. Eureka Spatial Generate end products while mining Add algorithm to detect MCSs Pose question and get answers from the Knowledge Repository (such as coincidence search, relationship testing) Anyone can access the knowledge base via the web Scientists/Researchers can ask questions such as: End Users What is the latitudinal distribution of MCSs? Which continent has more MCSs? What is the seasonal distribution of MCSs? What is the relationship between the number of MCSs and their intensity? Generate information useful to the general public ( students, researchers, policy makers etc) Images Forecast aids General Science information Answer the practical side of the problem

Challenges Develop and document common/standard interfaces for interoperability of data and services Design new data models for handling real-time/streaming input data fusion/integration Design and develop distributed standardized catalog capabilities Develop advanced resource allocation and load balancing techniques Exploit distributed infrastructure for enhanced data mining functionality (grids, web services, etc) Develop more intelligent and intuitive user interfaces Develop ontologies of scientific data, processes and data mining techniques for multiple domains Support language and system independent components Incorporate data mining into scientific curricula