Silvia Nittel University of California, Los Angeles Scientific Data Mining in ESP2Net
Silvia NittelGeoSKI Februar 2000 Overview Motivation What is scientific data mining ? Examples of scientific data mining at UCLA CS interests in scientific data mining –Tools –Collaboration paradigms –Interoperability
Silvia NittelGeoSKI Februar 2000 Motivation The advent of the computer has brought with it the ability to generate and store huge amounts of data. Business data (DBs) Scientific Data What is it ? The process of extracting useful information has become more formalized and the term Data Mining has been coined for it.
Silvia NittelGeoSKI Februar 2000 What is data mining ? Definition: Data mining is the process of extracting previously unknown, comprehensible, valid and actionable information from large data stores (and using it to make crucial business decisions). There are two approaches: –verification driven, whose aim is to validate a hypothesis postulated by a user, or –discovery driven, which is the automatic discovery of information by the use of appropriate tools. The discovery driven approach depends on a more sophisticated and structured search of the data for associations, patterns, rules or functions, and then having the analyst review them for value.
Silvia NittelGeoSKI Februar 2000 Process
Silvia NittelGeoSKI Februar 2000
Silvia NittelGeoSKI Februar 2000 What is scientific data mining ? Data mining started with “simple info” (business data) like in DBMS; this is called OLAP (online analytical processing). Scientific data mining: –Data is more complex. –Data is much larger. –Often discovery-oriented approach used. Medicine, Biology, Physics, Weather… Principles of a science method: –observation-hypothesis-experiment cycle Data mining for science: –“observation-hypothesis” supported by discovery driven mining –“hypothesis-experiment” supported by verification driven mining
Silvia NittelGeoSKI Februar 2000 Example: Farming Environment Goal: –optimization of crop yield while minimizing the resources supplied. –How: identify what factors affect the crop yield, One analysis looked at over 64 separate items measured over a number of years to extract the items that were significant. Initially analysis: discovery driven mining –To attempt to find what parameters were significant, either by themselves or in conjunction with others. –Use of statistical methods to determine the parameters that are significant and their relative influence. –Result: derive equation of interdependence Later on: verify equation via verification driven mining against new datasets.
Silvia NittelGeoSKI Februar 2000 Example: Global Climate Change Often a verification driven mining approach. –Climate data has been collected for many centuries. –It is extended into the more distant past through such activities as analysis of ice core samples from the Antarctic. –At the same time, a number of different predictive models have been proposed for future climatic conditions. Use predicitive model: –Use sample data from the past –Verify the predictive models by Using them on historical data then compared the results with the sample data. –From this, the models can then be refined further and used for another round of verification driven mining.
Silvia NittelGeoSKI Februar 2000 Scientific Data Mining at UCLA Project scope: –ESP2Net: Earth Science Partners’ Private Network Computer science: UCLA, HRL, Earth science: JPL, Scripps, U Arizona Scientific data mining: –Verification driven approach –Large amounts of raster satellite data
Silvia NittelGeoSKI Februar “Warm pool” develops in tropical Pacific ocean 3 Vigorous convection produces very high cold clouds 4 Storm systems push “moisture flare” Eastward 2 Warm moist air rapidly rises 5 Heavy rainfall over Southwest U.S. VPN Hypothesis: Coastal rainfall correlated with remote convective events in tropical Pacific ISCCP DX, CL UA Cluster operators Matching operators JPL TOVS, NVAP, MLS Tracking operators Statistical operators Scripps Precipitation Correlation operators GLINT operators Scientific Data Mining at UCLA
Silvia NittelGeoSKI Februar 2000 Visualization Convective cloud cluster motion –ISCCP CL, March (UA) Water vapor motion in the atmosphere –NVAP, March (Scripps) Different perspective reveals new info –NVAP stacking and slicing (JPL) Cloud movie Water vapor movie
Silvia NittelGeoSKI Februar 2000 Challenges of Scientific Data Mining Challenges : Distributed collaboration –share results (passive) –share analysis processes (active) Leverage partners expertise and efforts Re-use core analysis tools (operators) Large datasets, decadal time spans (> ½ TB data) Project goal: Build a flexible and extensible framework for scientific investigations which are Distributed and internet-based, provide reusable, extensible, efficient tools, address interoperability and collaboration
Silvia NittelGeoSKI Februar 2000 UCLA Support of Scientific Data Mining Re-usable Tools: –Conquest (CONcurrent Queries in Space and Time) Collaboration Support: –Scientific Markup Language (SEML): XML-based Scientific Experiment Logbook –Conquest (Distributed Queries) –Secure Collaboration (Virtual Private Networks) Interoperability –OpenGIS standard to represent data –CORBA –Java
Silvia NittelGeoSKI Februar 2000 Summary Scientific data mining is a relatively new research area (first conference in 1994, KKD) Science (hypothesis) Statistics (methods) Computer Science (visualization, animation)