Department of Computer Science 1 Data Mining / KDD Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid,

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Advertisements

Reported by Sujing Wang UH-DMML Group Meeting Nov. 22, 2010.
Chung Sheng CHEN, Nauful SHAIKH, Panitee CHAROENRATTANARUK, Christoph F. EICK, Nouhad RIZK and Edgar GABRIEL Department of Computer Science, University.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
UH Data Mining & Machine Learning Group May 1, 2009 Christoph F. Eick Department of Computer Science University of Houston A Domain-Driven Framework.
Geographic Data Mining Marc van Kreveld Seminar for GIVE Block 1, 2003/2004.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Data Warehouse Fundamentals Rabie A. Ramadan, PhD 2.
1 Modeling Evolution in Spatial Datasets Paul Amalaman 2/17/2012 Dr Eick Christoph Nouhad Rizk Zechun Cao Sujing Wang Data Mining and Machine Learning.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Examples of.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
Cyber-Infrastructure for Agro-Threats Steve Goddard Computer Science & Engineering University of Nebraska-Lincoln.
Spatial Statistics and Spatial Knowledge Discovery First law of geography [Tobler]: Everything is related to everything, but nearby things are more related.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for the Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Frameworks and Algorithms for Regional Knowledge Discovery Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Why is.
Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.
Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering Christoph F. Eick, Banafsheh Vaezian, Dan Jiang, Jing Wang PKDD Conference,
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Name: Sujing Wang Advisor: Dr. Christoph F. Eick
A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science.
Extracting Regional Knowledge from Spatial Datasets Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Why is Regional.
Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
Department of Computer Science 2015 Research Areas and Projects 1.Data Mining and Machine Learning Group (UH-DMML) Its research is focusing on: 1.Spatial.
1. Data Mining (or KDD) Let us find something interesting! Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Data Mining & Machine Learning Group ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.
MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,
27-18 września Data Mining dr Iwona Schab. 2 Semester timetable ORGANIZATIONAL ISSUES, INDTRODUCTION TO DATA MINING 1 Sources of data in business,
Department of Computer Science 1 KDD / Data Mining Let us find something interesting!  Motivation: We are drowning in data, but we are staving for knowledge.
Data Mining & Machine Learning Group ADMA09 Rachsuda Jianthapthaksin, Christoph F. Eick and Ricardo Vilalta University of Houston, Texas, USA A Framework.
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Data Mining & Machine Learning Group UH-DMML: Ongoing Data Mining Research Data Mining and Machine Learning Group, Computer Science Department,
Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Change Analysis in Spatial Datasets by Interestingness Comparison Vadeerat Rinsurongkawong, and Christoph F. Eick Department of Computer Science, University.
Data Mining and Machine Learning Group (UH-DMML) Wei Ding Rachana Parmar Ulvi Celepcikay Ji Yeon Choo Chun-Sheng Chen Abraham Bagherjeiran Soumya Ghosh.
Department of Computer Science Research Focus of UH-DMML Christoph F. Eick Data Mining Geographical Information Systems (GIS) High Performance Computing.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Department of Computer Science 1 Data Mining / KDD Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid,
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Data Mining & Machine Learning Group UH-DMML: Ongoing Data Mining Research Data Mining and Machine Learning Group, Computer Science Department, University.
Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
What Else is Important in AI we Did not Cover?
Data Mining – Intro.
A Black-Box Approach to Query Cardinality Estimation
Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab  aims at the development of data analysis, data mining, GIS and artificial.
Research Areas and Projects
COSC 6335 Data Mining Fall 2009: Assignment3a Post Analysis
Data Analysis and Intelligent Systems Lab
Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab  aims at the development of data analysis, data mining, GIS and artificial.
Data Analysis and Intelligent Systems Lab
Research Areas Christoph F. Eick
UH-COSC Events Today, 4-6p: Student Welcome Party
Data Warehousing and Data Mining
UH-DMML: Ongoing Data Mining Research
Section 4: see other Slide Show
Section 4: see other Slide Show
Data Analysis and Intelligent Systems Lab
Brainstorming How to Analyze the 3AuCountHand Datasets
Spatial Data Mining Definition: Spatial data mining is the process of discovering interesting patterns from large spatial datasets; it organizes by location.
Presentation transcript:

Department of Computer Science 1 Data Mining / KDD Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad)

Department of Computer Science Research Focus of UH-DMML Christoph F. Eick Data Mining Geographical Information Systems (GIS) High Performance Computing Machine Learning Helping Scientists to Make Sense of their Data Output: Graduated 12 PhD students (5 in ) and 79 Master Students

Department of Computer Science Research Areas and Projects 1.Data Mining and Machine Learning Group Its research is focusing on: 1.Spatial Data Mining 2.Clustering 3.Helping Scientists to Make Sense out of their Data 4.Classification and Prediction 1.Current and Planned Projects 1.Spatial Clustering Algorithms with Plug-in Fitness Functions and Other Non-Traditional Clustering Approaches 2.Patch-based Prediction Techniques 3.Mining Point of Interest (POI) Datasets and its Application to Urban Computing and Understanding Causes of Alcohol Addiction 4.Data Mining with a Lot of Cores 5.Educational Data Mining UH-DMML

Department of Computer Science Mining POI Datasets Motivation:  A lot of POI datasets (e.g. in Google Earth) are becoming available now.   Buildings of the City of Chicago (830,000 Polygons) : Challenges:  Extract Valuable Knowledge from such datasets  Data Mining  Facilitate Querying and Visualizing of such dataset  HPC / BigData Initiative

Department of Computer Science Summarizing the Composition of Spatial Datasets Given: A Spatial Dataset which Covers an Area of Interest Output: A Partitioning of the Area of Interest into Uniform Regions Applications: Urban Computing( ) / Alcohol Addiction Ch. Eick

Department of Computer Science Non-Traditional Clustering Algorithms UH-DMML Clustering Algorithms With plug-in Fitness Functions Mining Spatio-Temporal Datasets Parallel Computing Prototype-based Clustering Randomized Hill Climbing With a Lot of Cores Agglomerative Clustering and Hotspot Discovery Algorithms Creating Polygon Models For Spatial Clusters

Department of Computer Science Current Suite of Spatial Clustering Algorithms  Representative-based: SCEC, CLEVER  Grid-based: SCMRG,…  Agglomerative: MOSAIC  Density-based: DCONTOUR (not really plug-in but some fitness functions can be simulated ) Clustering Algorithms Density-based Agglomerative-basedRepresentative-based Grid-based Remark: All algorithms partition a dataset into clusters by maximizing a reward-based, plug-in fitness function.

Department of Computer Science MOSAIC—a Clustering Algorithm that Supports Plug-in Fitness Functions Fig. 6: An illustration of MOSAIC’s approach (a) input (b) output MOSAIC supports plug-in fitness functions and provides a generic framework that integrates representative-based clustering, agglomerative clustering, and proximity graphs, and which approximates arbitrary shape clusters using unions of small convex polygons.

Department of Computer Science Patch-based Prediction Techniques a.New Algorithms for Regression Tree Induction b.New Decision Tree Induction Algorithms c.Multi-Target Regression d.Spatial Prediction Techniques Ch. Eick

Department of Computer Science Helping Scientists to Make Sense Out of their Data Ch. Eick Figure 1: Co-location regions involving deep and shallow ice on Mars Figure 2: Interestingness hotspots where both income and CTR are high. Figure 3: Mining hurricane trajectories

Department of Computer Science Other Unassigned Research Topics  Trajectory Classification and Prediction  Collocation Mining  Creating Parallel Versions of Existing Clustering Algorithms  Models for the Evolution of Spatial Datasets  Hierarchical Learning Algorithms  … ? Ozone Hotspot Evolution 3p 5p 7p

Department of Computer Science UH-DMML Mission Statement The Data Mining and Machine Learning Group at the University of Houston aims at the development of data analysis, data mining, and machine-learning techniques and to apply those techniques to challenging problems in geology, astronomy, urban computing, ecology, environmental sciences, web advertising and medicine. In general, our research group has a strong background in the areas of clustering and spatial data mining. Areas of our current research include: clustering algorithms with plug-in fitness functions, association analysis, mining related spatial data sets, patch- based prediction techniques, summarizing the composition of spatial datasets, change and progression analysis, and data mining with a lot of cores. Website: Research Group Publications: Data Mining Course Website: Machine Learning Course Website: Ch. Eick

Department of Computer Science Reading Material Urban Computing/Spatial Clustering: SIGKDD Urban Computing Workshop 2013 Paper Agglomerative Clustering: R. Jiamthapthaksin, C. F. Eick, and S. Lee, GAC-GEO: A Generic Agglomerative Clustering Framework for Geo- referenced Datasets, in Knowledge and Information Systems (KAIS).GAC-GEO: A Generic Agglomerative Clustering Framework for Geo- referenced Datasets Patch-based Prediction Techniques: MLDM 2013 Paper, ACM-GIS 2010 Paper Data Mining with a lot of Cores: ParCo 2011 Paper GIS/Creating Polygon Models: ACM-GIS 2013 Submission Machine Learning Course Website: Collocation Mining: ACM-GIS 2008 Paper Spatial Clustering and Association Analysis: W. Ding, C. F. Eick, X. Yuan, J. Wang, and J.-P. Nicot, A Framework for Regional Association Rule Mining and Scoping in Spatial Datasets, Geoinformatica (2011) 15:1-28, DOI /s , January 2011.A Framework for Regional Association Rule Mining and Scoping in Spatial Datasets Supervised Clustering: TAI 2005 Paper Ch. Eick

Department of Computer Science What Courses Should You Take to Conduct Research in this Research Group? I. Data Mining II. Machine Learning III. Parallel Programming, AI, Software Design, Data Structures, Databases, Visualization, Evolutionary Computing, Image Processing, GIS courses, Geometry, Optimization. UH-DMML

Department of Computer Science Some UH-DMML Graduates 1 Christoph F. Eick Dr. Wei Ding, Assistant Professor Department of Computer Science, University of Massachusetts, Boston Sharon M. Tuttle, Professor, Department of Computer Science, Humboldt State University, Arcata, California Tae-wan Ryu, Professor, Department of Computer Science, California State University, Fullerton

Department of Computer Science Some UH-DMML Graduates 2 Christoph F. Eick Ruth Miller Ruth Miller, PhD Washington Unversity in St. Louis, Postdoc - Midwest Alcohol Research Center, Department of Psychiatry. Adjunct Instructor - Department of Computer Science Chun-sheng Chen, PhD TidalTV, Baltimore (an internet advertizing company) Rachsuda Jiamthapthaksin PhD Lecturer Assumption University, Bangkok, Thailand Justin Thomas MS Section Supervisor at Johns Hopkins University Applied Physics Laboratory Mei-kang Wu MS Microsoft, Bellevue, Washington Jing Wang MS AOL, California

Department of Computer Science Models for Progression of Hotspots and Other Spatial Objects Ch. Eick ? Ozone Hotspot Evolution ? Building Evolution ? Progression of Glaucoma 3p 5p 7p

Department of Computer Science Mining Related Datasets Using Polygon Analysis Work on a methodology that does the following: 1.Generate polygons from spatial cluster extensions / from continuous density or interpolation functions. 2.Meta cluster polygons / set of polygons 3.Extract interesting patterns / create summaries from polygonal meta clusters Christoph F. Eick Analysis of Glaucoma Progression Analysis of Ozone Hotspots

Department of Computer Science Clustering and Hotspot Discovery in Labeled Graphs Ch. Eick Potential Problems to be investigated: 1. Clustering Protein Based on Their Interactions 2. Generalize Region Discovery Framework to Graphs Partitioning Using Plug-in Interestingness Functions 3. … 4. …

Department of Computer Science Subtopics: Disparity Analysis/Emergent Pattern Discovery (“how do two groups differ with respect to their patterns?”) [SDE10] Change Analysis ( “what is new/different?”) [CVET09] Correspondence Clustering (“mining interesting relationships between two or more datasets”) [RE10] Meta Clustering (“cluster cluster models of multiple datasets”) Analyzing Relationships between Polygonal Cluster Models Example: Analyze Changes with Respect to Regions of High Variance of Earthquake Depth. Novelty (r’) = (r’—(r1  …  rk)) Emerging regions based on the novelty change predicate Time 1 Time 2 UH-DMML Methodologies and Tools to Analyze and Mine Related Datasets

Department of Computer Science Mining Spatial Trajectories  Goal: Understand and Characterize Motion Patterns  Themes investigated: Clustering and summarization of trajectories, classification based on trajectories, likelihood assessment of trajectories, prediction of trajectories. UH-DMML Arctic Tern Arctic Tern MigrationHurricanes in the Golf of Mexico

Department of Computer Science Current UH-DMML Activities Christoph F. Eick Regional Knowledge Extraction Spatial Clustering Algorithms With Plug-in Fitness Functions Mining Related Datasets & Polygon Analysis Trajectory Mining Discrepancy Mining Regional Association Analysis Knowledge Scoping Regional Regression Parallel CLEVER TRAJ-CLEVER Poly-CLEVER SCMRG Strasbourg Building Evolution POLY/TRAJ- SNN Polygonal Meta Clustering Understanding Glaucoma Air Pollution Analysis Cluster Correspondence Analysis Cluster Polygon Generation MOSAIC Animal Motion Analysis Trajectory Density Estimation Classification Sub-Trajectory Mining Repository Clustering Yahoo! User Modeling Clustering Cougar^2

Data Mining & Machine Learning Group ACM-GIS08

Department of Computer Science Extracting Regional Knowledge from Spatial Datasets RD-Algorithm Application 1: Supervised Clustering [EVJW07] Application 2: Regional Association Rule Mining and Scoping [DEWY06, DEYWN07] Application 3: Find Interesting Regions with respect to a Continuous Variables [CRET08] Application 4: Regional Co-location Mining Involving Continuous Variables [EPWSN08] Application 5: Find “representative” regions (Sampling) Application 6: Regional Regression [CE09] Application 7: Multi-Objective Clustering [JEV09] Application 8: Change Analysis in Spatial Datasets [RE09] Wells in Texas: Green: safe well with respect to arsenic Red: unsafe well  =1.01  =1.04 UH-DMML

Department of Computer Science A Framework for Extracting Regional Knowledge from Spatial Datasets Framework for Mining Regional Knowledge Spatial Databases Integrated Data Set Domain Experts Fitness Functions Family of Clustering Algorithms Regional Association Rule Mining Algorithms Ranked Set of Interesting Regions and their Properties Measures of interestingness Regional Knowledge Regional Knowledge Objective: Develop and implement an integrated framework to automatically discover interesting regional patterns in spatial datasets. Hierarchical Grid-based & Density-based Algorithms Spatial Risk Patterns of Arsenic UH-DMML

Department of Computer Science REG^2: a Regional Regression Framework  Motivation: Regression functions spatially vary, as they are not constant over space  Goal: To discover regions with strong relationships between dependent & independent variables and extract their regional regression functions. UH-DMML AIC Fitness VAL Fitness RegVAL Fitness WAIC Fitness Arsenic 5.01%11.19%3.58%13.18% Boston 29.80%35.69%38.98%36.60%  Clustering algorithms with plug-in fitness functions are employed to find such region; the employed fitness functions reward regions with a low generalization error.  Various schemes are explored to estimate the generalization error: example weighting, regularization, penalizing model complexity and using validation sets,… Discovered Regions and Regression Functions REG^2 Outperforms Other Models in SSE_TR Regularization Improves Prediction Accuracy

Department of Computer Science Finding Regional Co-location Patterns in Spatial Datasets Objective: Find co-location regions using various clustering algorithms and novel fitness functions. Applications: 1. Finding regions on planet Mars where shallow and deep ice are co-located, using point and raster datasets. In figure 1, regions in red have very high co- location and regions in blue have anti co-location. 2. Finding co-location patterns involving chemical concentrations with values on the wings of their statistical distribution in Texas ’ ground water supply. Figure 2 indicates discovered regions and their associated chemical patterns. Figure 1: Co-location regions involving deep and shallow ice on Mars Figure 2: Chemical Co-location patterns in Texas Water Supply UH-DMML