UH-DMML: Ongoing Data Mining Research

Slides:



Advertisements
Similar presentations
Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Chung Sheng CHEN, Nauful SHAIKH, Panitee CHAROENRATTANARUK, Christoph F. EICK, Nouhad RIZK and Edgar GABRIEL Department of Computer Science, University.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
Algorithms and Problem Solving. Learn about problem solving skills Explore the algorithmic approach for problem solving Learn about algorithm development.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Data Mining – Intro.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Data Mining Techniques
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Examples of.
Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for the Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering Christoph F. Eick, Banafsheh Vaezian, Dan Jiang, Jing Wang PKDD Conference,
Name: Sujing Wang Advisor: Dr. Christoph F. Eick
A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science.
OPTIMIZATION OF FUNCTIONAL BRAIN ROIS VIA MAXIMIZATION OF CONSISTENCY OF STRUCTURAL CONNECTIVITY PROFILES Dajiang Zhu Computer Science Department The University.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Department of Computer Science 2015 Research Areas and Projects 1.Data Mining and Machine Learning Group (UH-DMML) Its research is focusing on: 1.Spatial.
1. Data Mining (or KDD) Let us find something interesting! Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,
Department of Computer Science 1 KDD / Data Mining Let us find something interesting!  Motivation: We are drowning in data, but we are staving for knowledge.
Data Mining & Machine Learning Group ADMA09 Rachsuda Jianthapthaksin, Christoph F. Eick and Ricardo Vilalta University of Houston, Texas, USA A Framework.
Data Mining & Machine Learning Group UH-DMML: Ongoing Data Mining Research Data Mining and Machine Learning Group, Computer Science Department,
Change Analysis in Spatial Datasets by Interestingness Comparison Vadeerat Rinsurongkawong, and Christoph F. Eick Department of Computer Science, University.
Data Mining and Machine Learning Group (UH-DMML) Wei Ding Rachana Parmar Ulvi Celepcikay Ji Yeon Choo Chun-Sheng Chen Abraham Bagherjeiran Soumya Ghosh.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Department of Computer Science 1 Data Mining / KDD Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid,
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Data Mining & Machine Learning Group UH-DMML: Ongoing Data Mining Research Data Mining and Machine Learning Group, Computer Science Department, University.
Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
1 Creating Situational Awareness with Data Trending and Monitoring Zhenping Li, J.P. Douglas, and Ken. Mitchell Arctic Slope Technical Services.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Introduction to Machine Learning, its potential usage in network area,
Big data classification using neural network
Data Mining – Intro.
Algorithms and Problem Solving
What Is Cluster Analysis?
Chapter 7. Classification and Prediction
Machine Learning overview Chapter 18, 21
Machine Learning overview Chapter 18, 21
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab  aims at the development of data analysis, data mining, GIS and artificial.
Research Areas and Projects
Data Analysis and Intelligent Systems Lab
Using Tensorflow to Detect Objects in an Image
Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab  aims at the development of data analysis, data mining, GIS and artificial.
Data Analysis and Intelligent Systems Lab
UH-COSC Events Today, 4-6p: Student Welcome Party
Data Analysis and Intelligent Systems Lab
Data Warehousing and Data Mining
Machine Learning with Weka
Data Analysis and Intelligent Systems Lab
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Algorithms and Problem Solving
Clustering Wei Wang.
Spatial Data Mining Definition: Spatial data mining is the process of discovering interesting patterns from large spatial datasets; it organizes by location.
Discovery of Interesting Spatial Regions
Automated Analysis and Code Generation for Domain-Specific Models
CSE572: Data Mining by H. Liu
Machine Learning overview Chapter 18, 21
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

UH-DMML: Ongoing Data Mining Research Data Mining and Machine Learning Group, Computer Science Department, University of Houston, TX April 4, 2008 Dr. Christoph F. Eick Abraham Bagherjeiran* Ulvi Celepcikay Chun-Sheng Chen Ji Yeon Choo* Wei Ding Paulo Martins Christian Giusti* Rachsuda Jiamthapthaksin Dan Jiang* Seungchan Lee Rachana Parmar* Vadeerat Rinsurongkawong Justin Thomas* Banafsheh Vaezian* Jing Wang*

Current Topics Investigated Region Discovery Framework Applications of Region Discovery Framework Spatial Databases Data Set Domain Expert Measure of Interestingness Acquisition Tool Fitness Function Family of Clustering Algorithms Visualization Tools Ranked Set of Interesting Regions and their Properties Region Discovery Display Database Integration Tool 5 2 Emergent pattern discovery Discovering regional knowledge in geo-referenced datasets Discovering risk patterns of arsenic 4 7 Cougar^2: Open Source DMML Framework 1 Development of Clustering Algorithms with Plug-in Fitness Functions Machine Learning Shape-aware clustering algorithms 6 3 Adaptive Clustering Distance Function Learning Using Machine Learning for Spacecraft Simulation

1. Development of Clustering Algorithms with Plug-in Fitness Functions

Clustering with Plug-in Fitness Functions Motivation: Finding subgroups in geo-referenced datasets has many applications. However, in many applications the subgroups to be searched for do not share the characteristics considered by traditional clustering algorithms, such as cluster compactness and separation. Consequently, it is desirable to develop clustering algorithms that provide plug-in fitness functions that allow domain experts to express desirable characteristics of subgroups they are looking for. Only very few clustering algorithms published in the literature provide plug-in fitness functions; consequently existing clustering paradigms have to be modified and extended by our research to provide such capabilities. Many other applications for clustering with plug-in fitness functions exist.

Current Suite of Clustering Algorithms Representative-based: SCEC, SRIDHCR, SPAM, CLEVER Grid-based: SCMRG, SCHG Agglomerative: MOSAIC, SCAH Density-based: SCDE Density-based Grid-based Representative-based Agglomerative-based Clustering Algorithms

2. Discovering Regional Knowledge in Geo-Referenced Datasets

Mining Regional Knowledge in Spatial Datasets Objective: Develop and implement an integrated framework to automatically discover interesting regional patterns in spatial datasets. Hierarchical Grid-based & Density-based Algorithms Framework for Mining Regional Knowledge Spatial Databases Integrated Data Set Domain Experts Fitness Functions Family of Clustering Algorithms Regional Association Rule Mining Algorithms Ranked Set of Interesting Regions and their Properties Measures of interestingness Regional Knowledge Given: A dataset O with a schema R A distance function d defined on instances of R A fitness function q(X) that evaluates clustering X={c1,…,ck} as follows: q(X)= cX reward(c)=cX interestingness(c)*size(c) with b>1 Objective: Find c1,…,ck  O such that: cicj= if ij X={c1,…,ck} maximizes q(X) All cluster ciX are contiguous (each pair of objects belonging to ci has to be delaunay-connected with respect to ci and to d) c1,…,ck  O c1,…,ck are usually ranked based on the reward each cluster receives, and low reward clusters are frequently not reported Spatial Risk Patterns of Arsenic

Finding Regional Co-location Patterns in Spatial Datasets Figure 1: Co-location regions involving deep and shallow ice on Mars Figure 2: Chemical co-location patterns in Texas Water Supply Objective: Find co-location regions using various clustering algorithms and novel fitness functions. Applications: 1. Finding regions on planet Mars where shallow and deep ice are co-located, using point and raster datasets. In figure 1, regions in red have very high co-location and regions in blue have anti co-location. 2. Finding co-location patterns involving chemical concentrations with values on the wings of their statistical distribution in Texas’ ground water supply. Figure 2 indicates discovered regions and their associated chemical patterns.

Regional Pattern Discovery via Principal Component Analysis Oner Ulvi Celepcikay Calculate Principal Components & Variance Captured Apply PCA-Based Fitness Function & Assign Rewards Discover Regions & Regional Patterns (Globally Hidden) Region Discovery Post-Processing Objective: Discovering regions and regional patterns using Principal Component Analysis (PCA) Applications: Region discovery, regional pattern discovery (i.e. finding interesting sub-regions in Texas where arsenic is highly correlated with fluoride and pH) in spatial data, and regional regression. Idea: Correlation patterns among attributes tend to be hidden globally. But with the help of statistical approaches and our region discovery framework, some interesting regional correlations among the attributes can be discovered.

Regional Pattern Discovery via Principal Component Analysis Oner Ulvi Celepcikay Calculate Principal Components & Variance Captured Apply PCA-Based Fitness Function & Assign Rewards Discover Regions & Regional Patterns (Globally Hidden) Region Discovery Post-Processing using PCA Results PCA-based Distance matrix Highest Correlated Attributes Set (HCAS) Distance Matrix using Regression Analysis Global Regression Model Regional Effects Model t-statistics model (to test if the difference between regions is Statistically Significant)

3. Shape-Aware Clustering Algorithms

Discovering Clusters of Arbitrary Shapes Rachsuda Jiamthapthaksin, Christian Giusti, and Jiyeon Choo Objective: Detect arbitrary shape clusters effectively and efficiently. 2nd Approach: Approximate arbitrary shapes using unions of small convex polygons. 3rd Approach: Employ density estimation techniques for discovering arbitrary shape clusters. 1st Approach: Develop cluster evaluation measures for non-spherical cluster shapes. Derive a shape signature for a given shape. (boundary-based, region-based, skeleton based shape representation) Transform the shape signature into a fitness function and use it in a clustering algorithm. Rachsuda, Christian, Jiyeon, … Figure 1: Chain-like clusters in Volcano dataset. Figure 2 (a) Binary Complex9 original dataset; two class values Figure 2 (b) Binary Complex9 dataset density function

4. Discovering Risk Patterns of Arsenic

Discovering Spatial Patterns of Risk from Arsenic: A Case Study of Texas Ground Water Wei Ding, Vadeerat Rinsurongkawong and Rachsuda Jiamthapthaksin Objective: Analysis of Arsenic Contamination and its Causes. Collaboration with Dr. Bridget Scanlon and her research group at the University of Texas in Austin. Our approach Wei, Rachsuda, Vadeerat Experimental Results

5. Emergent Pattern Discovery

Objectives of Emergent Pattern Discovery Emergent patterns capture how the most recent data differ from data in the past. Emergent pattern discovery finds what is new in data. Challenges of emergent pattern discovery include: The development of a formal framework that characterizes different types of emergent patterns The development of a methodology to detect emergent patterns in spatio-temporal datasets The capability to find emergent patterns in regions of arbitrary shape and granularity The development of scalable emergent pattern discovery algorithms that are able to cope with large data sizes and large numbers of patterns Time 0 Time 1 The change from time 0 to 1 Emergent pattern discovery for Earthquake data

Approaches for Emergent Pattern Discovery Vadeerat Rinsurongkawong and Chun-Sheng Chen Approach1 Direct Analysis Approach2 Analysis when object ID is unknown Indirect analysis through forward-backward analysis based on re-clustering Approach3 Analysis when object ID is known Analysis by computing Agreement and Containment

6. Machine Learning

Online Learning of Spacecraft Simulation Models Developed an online machine learning methodology for increasing the accuracy of spacecraft simulation models Directly applied to the International Space Station for use in the Johnson Space Center Mission Control Center Approach Use a regional sliding-window technique , a contribution of this research, that regionally maintains the most recent data Build new system models incrementally from streaming sensor data using the best training approach (regression trees, model trees, artificial neural networks, etc…) Use a knowledge fusion approach, also a contribution of this research, to reduce predictive error spikes when confronted with making predictions in situations that are quite different from training scenarios Benefits Increases the effectiveness of NASA mission planning, real-time mission support, and training Reacts the dynamic and complex behavior of the International Space Station (ISS) Removes the need for the current approach of refining models manually Results Substantial error reductions up to 76% in our experimental evaluation on the ISS Electrical Power System Cost reductions due to complete automation of the previous manually-intensive approach

Distance function: Measure the similarity between objects. Distance Function Learning Using Intelligent Weight Updating and Supervised Clustering Distance function: Measure the similarity between objects. Objective: Construct a good distance function using AI and machine learning techniques that learn attribute weights. The framework: Generate a distance function: Apply weight updating schemes / Search Strategies to find a good distance function candidate Clustering: Use this distance function candidate in a clustering algorithm to cluster the dataset Evaluate the distance function: We evaluate the goodness of the distance function by evaluating the clustering result according to a predefined evaluation function. Bad distance function Q1 Good distance function Q2 Clustering X Distance Function Q Cluster Goodness of the Distance q(X) Clustering Evaluation Weight Updating Scheme / Search Strategy Chunsheng

Cougar^2: Open Source Data Mining and Machine Learning Framework

Cougar^2: Open Source Data Mining and Machine Learning Framework Rachana Parmar, Justin Thomas, Rachsuda Jiamthapthaksin, Oner Ulvi Celepcikay Department of Computer Science, University of Houston, Houston TX ABSTRACT Cougar^21 is a new framework for data mining and machine learning. Its goal is to simplify the transition of algorithms on paper to actual implementation. It provides an intuitive API for researchers. Its design is based on object oriented design principles and patterns. Developed using test first development (TFD) approach, it advocates TFD for new algorithm development. The framework has a unique design which separates learning algorithm configuration, the actual algorithm itself and the results produced by the algorithm. It allows easy storage and sharing of experiment configuration and results. ABSTRACT FRAMEWORK ARCHITECTURE METHODS The framework architecture follows object oriented design patterns and principles. It has been developed using Test First Development approach and adding new code with unit tests is easy. There are two major components of the framework: Dataset and Learning algorithm. Datasets deal with how to read and write data. We have two types of datasets: NumericDataset where all the values are of type double and NominalDataset where all the values are of type int where each integer value is mapped to a value of a nominal attribute. We have a high level interface for Dataset and so one can write code using this interface and switching from one type of dataset to another type becomes really easy. Learning algorithms work on these data and return reusable results. To use a learning algorithm requires configuring the learner, running the learner and using the model built by the learner. We have separated these tasks in three separate parts: Factory – which does the configuration, Learner – which does actually learning/data mining task and builds the model and Model – which can be applied on new dataset or can be analyzed. Parameter configuration Factory Learner Dataset Model creates builds uses applies to MOTIVATION Typically machine learning and data mining algorithms are written using software like Matlab, Weka, RapidMiner (Formerly YALE) etc. Software like Matlab simplify the process of converting algorithm to code with little programming but often one has to sacrifice speed and usability. On the other extreme, software like Weka and RapidMiner increase the usability by providing GUI and plug-ins which requires researchers to develop GUI. Cougar^2 tries to address some of the issues with these software. Reusable and Efficient software Test First Development Platform Independent Support research efforts into new algorithms Analyze experiments by reading and reusing learned models Intuitive API for researchers rather than GUI for end users Easy to share experiments and experiment results A SUPERVISED LEARNING EXAMPLE Hot No Yes Sunny Outlook Overcast Cold Temp. Decision Tree Factory Decision Tree Learner Model (Decision Tree) Dataset A REGION DISCOVERY EXAMPLE CURRENT WORK BENEFITS OF COUGAR^2 Several algorithms have been implemented using the framework. The list includes SPAM, CLEVER and SCDE. Algorithm MOSAIC is currently under development. A region discovery framework and various interestingness measures like purity, variance, mean squared error have been implemented using the framework. Developed using: Java, JUnit, EasyMock Hosted at: https://cougarsquared.dev.java.net Dataset Region Discovery Factory Region Discovery Algorithm Region Discovery Model 1: First version of Cougar^2 was developed by a Ph.D. student of the research group – Abraham Bagherjeiran 22