Download presentation
Presentation is loading. Please wait.
Published byVirginia Fletcher Modified over 6 years ago
1
Creating a Data Mining Environment for Geosciences
Interface 2002 Montreal April 18, 2002 Sara J. Graves Director, Information Technology and Systems Center Professor, Computer Science Department University of Alabama in Huntsville Director, Information Technology and Research Center National Space Science and Technology Center
3
Characteristics of Science Data
Varied kinds of data Raster images With structure and geometry Multispectral Time series and sequence data Numerical model outputs Multiple resolutions/multiple scales Variability of data formats Granularity of data Includes spatial and temporal dimensions Physical basis/domain knowledge needed before applying algorithms Typically requires domain-specific algorithms
4
Scientific Analysis Harnesses human analysis capabilities
Highly creative Based on theory and hypothesis formulation Physical basis is normally used for algorithms Drawing insights about the underlying phenomena Rapidly widening gap between data collection capabilities and the ability to analyze data Potential of vast amounts of data to be unused
5
Data Mining Provides automation of the analysis process
Can be used for dimensionality reduction when manual examination of data is impossible Can have limitations May not utilize domain knowledge May be difficult to prove validity of the results There may not be a physical basis Should be viewed as complementary tool and not a replacement for scientific analysis
6
Reasons for Mining Science Data
Powerful tool for research and analysis given the volume of science data Necessity when manual examination of data is impossible Can allow scientists to refine/add more layers to the knowledge bases Can minimize scientists’ data handling to allow them to maximize research time Can reduce “reinventing the wheel” Can fully exploit reusable knowledge bases for different problems Can be integrated into a Next Generation Information System to provide additional services such as: Custom Order Processing Subsetting/Formatting/Gridding …. Event/Relationship Searching
7
Similarity between Data Mining and Scientific Analysis Process
8
Data Challenge Data Preparation for Mining/Analysis Results Search and
Access Data Data Integration Data Transformation Data Reduction Data Mining Science Analysis Open your presentation with an attention-getting incident. Choose an incident your audience relates to. The incidence is the evidence that supports the action and proves the benefit. Beginning with a motivational incident prepares your audience for the action step that follows. Results Data Preparation for Mining/Analysis Data Sets
9
Typical Data Preparation Operations
Data Cleaning Clean data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. Fairly well handled Data Integration Integration of multiple data files Data Transformations Normalization and aggregation Data Reduction Obtain a reduced representation of the data set, which produces the same analytical results
10
User Perspective and Data Perspective of the Data Mining Process
Analysis Decision Value Volume Transformation Knowledge Preprocessing Information Dataset Specific Algorithms Domain Specific Algorithms Open your presentation with an attention-getting incident. Choose an incident your audience relates to. The incidence is the evidence that supports the action and proves the benefit. Beginning with a motivational incident prepares your audience for the action step that follows. Data Calibration & Navigation Data Stores User Perspective Dataset Data Perspective
11
Scientific Data Mining Environment Stakeholders
End Users Scientists
12
Scientist’s Perspective
Define the experiment Reduce data volume Create reusable “Knowledge Base” Iterate over experiment to refine the knowledge base Minimize data handling/Maximize research Add more “layers” to the knowledge base Allow different levels of knowledge discovery: Shallow knowledge Hidden Deep
13
End User’s Perspective
End users can be: Students Public Decision makers Other Scientists Access to data Access to knowledge base End products
14
NASA Workshop on Issues of Application of Data Mining to Scientific Data
Held on October 19-21, 1999 at University of Alabama in Huntsville Domain Focus: Global Change Natural Hazard Terrestrial Ecology Key Recommendations Need to create a data mining environment for facilitation, scalability and automation of scientific analysis for large scale data streams Need to formulate critical partnerships between physical scientists, computer scientists and statisticians for an effective integration of analysis processes, scientific algorithms, statistical approaches and enabling computer architectures
15
Reasons for Building a Data Mining Environment
Provide the capabilities and flexibility of creative scientific analysis Provide an infrastructure of mining algorithms and knowledge bases for creative analysis to reduce “reinventing the wheel” Provide capabilities to add “science algorithms” to the framework Support a spectrum of heterogeneous participants, data sources and technological approaches Provide a framework with components and suitable management of the interfaces between them Allow scientists to refine/add domain information to the mining environment Minimize scientists’ data handling to allow them to maximize research time Incorporate relevance feedback mechanism to learn methodologies for multiple domains Integrate data mining functionality into other distributed systems
16
Mining Environment: When,Where, Who and Why?
User Workstation Data Mining Center GRID WHEN Real Time On-Ingest On-Demand Repeatedly WHO End Users Domain Experts Mining Experts WHY Event Relationship Association Corroboration Collaboration Data Mining
17
ADaM History Algorithm Development and Mining (ADaM) System
ADaM system developed under NASA HQ research grant The system provides knowledge discovery, feature detection and content-based searching for data values, as well as for metadata. It contains over 120 different operations to be performed on the input data stream. Operations vary from specialized atmospheric science data-set specific algorithms to different digital image processing techniques, processing modules for automatic pattern recognition, machine perception, neural networks and genetic algorithms. Developed an Event/Relationship Search System
18
ADaM Engine Architecture
Preprocessed Data Patterns/ Models Results Data Translated Data Processing Input HDF HDF-EOS GIF PIP-2 SSM/I Pathfinder SSM/I TDR SSM/I NESDIS Lvl 1B SSM/I MSFC Brightness Temp US Rain Landsat ASCII Grass Vectors (ASCII Text) Intergraph Raster Others... Preprocessing Analysis Output GIF Images HDF-EOS HDF Raster Images HDF SDS Polygons (ASCII, DXF) SSM/I MSFC Brightness Temp TIFF Images Others... Selection and Sampling Subsetting Subsampling Select by Value Coincidence Search Grid Manipulation Grid Creation Bin Aggregate Bin Select Grid Aggregate Grid Select Find Holes Image Processing Cropping Inversion Thresholding Others... Clustering K Means Isodata Maximum Pattern Recognition Bayes Classifier Min. Dist. Classifier Image Analysis Boundary Detection Cooccurrence Matrix Dilation and Erosion Histogram Operations Polygon Circumscript Spatial Filtering Texture Operations Genetic Algorithms Neural Networks Others...
19
Extensibility of ADaM Training Data Input Visualization/ Analysis
General Purpose Algorithms Trainable Classifiers User Defined Training Data Input ADaM Mining Engine Input Modules Analysis Modules Output Modules Visualization/ Analysis Packages
20
Data Mining Environment
Distributed Clients Web-based Workstation based Other Systems Common Client API Analysis/Vis Tools Knowledge Base Data Mining Server Mining Results Mining Engine (ADaM) Analysis Modules Input Output Event/ Relationship Search System Data Stores
21
Event/Relationship Search System
Allows users to conduct coincidence searches and relationship tests between mined phenomena and a variety of parameters Parameters include geographic regions, political boundaries, or other named phenomena for a specific time period
22
An Environment for On-board processing: EVE
Real-time on-board data mining can provide unique capabilities Anomaly detection Autonomous control and decision making Immediate response Direct satellite to Earth delivery of results The Sensor Web is expanding and more processing is available on-orbit 12/2/2018
23
Sensor Data Simulations
Major EVE Components EVE Software Architecture Output Modules Analysis Input XML Based Processing Plans Inter Process Communcation Decision Support Processing Plan Editor etc. Sensor Model Library On-board Configuration Sensor Data Simulations IR Passive Microwave Testbed of On-board Systems Flight Linux RT Java Control Systems Ground Control Testbed System Specific Cross-Compiler (e.g. VXWorks)
24
Interchange Technology: Earth Science Markup Language (ESML)
Facilitate effective utilization of distributed, heterogeneous data products Enable interchangeable tools and services Data Applications Interchange Technology Integration
25
Interoperability: Accessing Heterogeneous Data
Earth science data comes in: Different formats, types and structures Different states of processing (raw, calibrated, derived, modeled or interpreted) Enormous volumes One approach: Standard data formats Difficult to implement and enforce Can’t anticipate all needs Some data can’t be modeled or is lost in translation The cost of converting legacy data A better approach: Interchange technologies Earth Science Markup Language HDF-EOS HDF netCDF ASCII GRIB Binary
26
Data Usability The Problem The Solution
ESML FILE LIBRARY APPLICATION DATA FORMAT 1 FORMAT 2 FORMAT 3 DATA FORMAT 1 DATA FORMAT 2 DATA FORMAT 3 FORMAT CONVERTER READER 1 READER 2 APPLICATION Specialized code for every format Difficult to assimilate new data types Expensive to convert legacy data Data interoperability “Define Once, Use Anywhere”
27
Earth Science Markup Language
Specialized markup language for Earth Science information based on XML Beyond traditional metadata to include both structural and semantic information needed to effect a practical runtime interpretation of a data set Benefits of ESML: Enables independently developed applications and services to effectively utilize distributed, heterogeneous data products Allows the end-user to integrate data sets of differing structures to aid in data fusion and analysis without having to write a special reader for each data set Is simple enough that end-users can create their own ESML for on- hand datasets (new or legacy)
28
Example: ESML file for an Image
Hurricane Mitch (981027) Binary format (32 bits/pixel) Mitch.esml <ESML> <SyntacticMetaData> <Binary GeoInfo=“NoGeoInfo”> <Array name=“Track” occurs=“300“ DimName=“Y“> <Field name=“Channel 1” type=“int” size=“32” occurs=“512” DataType=“Data” DimName=“X”/> <Data/> </Array> </Binary> </SyntacticMetaData> </ESML> 512 300
29
Atmospheric Science Mining Applications
Lightning Detection Rainfall Identification & Estimation Study Rainfall Accumulation Study Tropical Cyclone Detection and Wind Speed Estimation GOES Cumulus Cloud Classification Mesoscale Convective System Detection Detection of Jet Streams in Numerical Model Data
30
Tropical Cyclone Detection: Estimating Maximum Wind Speed
Advanced Microwave Sounding Unit (AMSU-A)Data Water cover mask to eliminate land Laplacian filter to compute temperature gradients Science Algorithm to estimate wind speed Contiguous regions with wind speeds above a desired threshold identified Additional test to eliminate false positives Maximum wind speed and location produced Calibration/ Limb Correction/ Converted to Tb Hurricane Floyd Data Archive Mining Environment Results are placed on the web and made available to National Hurricane Center & Joint Typhoon Warning Center Result
31
Data Fusion and Mining: From Global Information to Local Knowledge
Emergency Response Precision Agriculture Urban Environments Weather Prediction
32
Mining as a Web Service
33
Mining on Information Power Grid (IPG) using ADaM
Data Archive X IPG Mining Agent IPG Processor Mining Daemon Control Database IPG Processor Mining Operations Repository IPG Processor IPG Processor Mining LDAP Server IPG Mining Agent IPG Processor Satellite Data Archive Y
34
Earth Science Example of Developing a Knowledge Network: Collaborative Research in Mesoscale Convective Systems Knowledge Base Information about MCSs detected Data Sets SSM/I (F13, F14) Visualization Eureka Interface ADaM System Database location size intensity etc. Eureka Spatial Generate end products while mining Add algorithm to detect MCSs Pose question and get answers from the Knowledge Repository (such as coincidence search, relationship testing) Anyone can access the knowledge base via the web Scientists/Researchers can ask questions such as: End Users What is the latitudinal distribution of MCSs? Which continent has more MCSs? What is the seasonal distribution of MCSs? What is the relationship between the number of MCSs and their intensity? Generate information useful to the general public ( students, researchers, policy makers etc) Images Forecast aids General Science information Answer the practical side of the problem
35
Challenges Develop and document common/standard interfaces for interoperability of data and services Design new data models for handling real-time/streaming input data fusion/integration Design and develop distributed standardized catalog capabilities Develop advanced resource allocation and load balancing techniques Exploit distributed infrastructure for enhanced data mining functionality (grids, web services, etc) Develop more intelligent and intuitive user interfaces Develop ontologies of scientific data, processes and data mining techniques for multiple domains Support language and system independent components Incorporate data mining into scientific curricula
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.