Frameworks and Algorithms for Regional Knowledge Discovery Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Why is.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Our Approach: Use a separate regression function for different regions. Problem: Need to find regions with a strong relationship between the dependent.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

11 Pre-conference Training MCH Epidemiology – CityMatCH Joint 2012 Annual Meeting Intermediate/Advanced Spatial Analysis Techniques for the Analysis of.

FTP Biostatistics II Model parameter estimations: Confronting models with measurements.

CLEVER: CLustEring using representatiVEs and Randomized hill climbing Rachana Parmar and Christoph F. Eick:

Chung Sheng CHEN, Nauful SHAIKH, Panitee CHAROENRATTANARUK, Christoph F. EICK, Nouhad RIZK and Edgar GABRIEL Department of Computer Science, University.

Model Assessment, Selection and Averaging

Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.

Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Correlation and Autocorrelation

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.

Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.

UH Data Mining & Machine Learning Group May 1, 2009 Christoph F. Eick Department of Computer Science University of Houston A Domain-Driven Framework.

Title: Spatial Data Mining in Geo-Business. Overview  Twisting the Perspective of Map Surfaces — describes the character of spatial distributions through.

Data Mining Techniques

Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Examples of.

Department of Computer Science 1 Data Mining / KDD Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid,

Spatial Statistics and Spatial Knowledge Discovery First law of geography [Tobler]: Everything is related to everything, but nearby things are more related.

Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for the Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.

Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.

Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering Christoph F. Eick, Banafsheh Vaezian, Dan Jiang, Jing Wang PKDD Conference,

1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.

Name: Sujing Wang Advisor: Dr. Christoph F. Eick

A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science.

Extracting Regional Knowledge from Spatial Datasets Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Why is Regional.

Spatial Data Analysis Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What is spatial data and their special.

Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.

Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Overview The goal of Project3 is to design a region discovery algorithm and evaluate.

Department of Computer Science 2015 Research Areas and Projects 1.Data Mining and Machine Learning Group (UH-DMML) Its research is focusing on: 1.Spatial.

Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.

1. Data Mining (or KDD) Let us find something interesting! Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially.

Data Mining & Machine Learning Group ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Department of Computer Science 1 KDD / Data Mining Let us find something interesting!  Motivation: We are drowning in data, but we are staving for knowledge.

Geo479/579: Geostatistics Ch4. Spatial Description.

Data Mining & Machine Learning Group ADMA09 Rachsuda Jianthapthaksin, Christoph F. Eick and Ricardo Vilalta University of Houston, Texas, USA A Framework.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.

Data Mining & Machine Learning Group UH-DMML: Ongoing Data Mining Research Data Mining and Machine Learning Group, Computer Science Department,

Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.

Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Change Analysis in Spatial Datasets by Interestingness Comparison Vadeerat Rinsurongkawong, and Christoph F. Eick Department of Computer Science, University.

Correlation & Regression Analysis

Data Mining and Machine Learning Group (UH-DMML) Wei Ding Rachana Parmar Ulvi Celepcikay Ji Yeon Choo Chun-Sheng Chen Abraham Bagherjeiran Soumya Ghosh.

Data Mining and Decision Support

Department of Computer Science Research Focus of UH-DMML Christoph F. Eick Data Mining Geographical Information Systems (GIS) High Performance Computing.

Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.

Department of Computer Science 1 Data Mining / KDD Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid,

Statistical methods for real estate data prof. RNDr. Beáta Stehlíková, CSc

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Data Mining & Machine Learning Group UH-DMML: Ongoing Data Mining Research Data Mining and Machine Learning Group, Computer Science Department, University.

Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.

Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.

Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.

CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.

Chapter 7. Classification and Prediction

Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab aims at the development of data analysis, data mining, GIS and artificial.

Research Areas and Projects

UH-DMML: Ongoing Data Mining Research

Frameworks and Algorithms for Regional Knowledge Discovery

Brainstorming How to Analyze the 3AuCountHand Datasets

Spatial Data Mining Definition: Spatial data mining is the process of discovering interesting patterns from large spatial datasets; it organizes by location.

Product moment correlation

Presentation transcript:

Frameworks and Algorithms for Regional Knowledge Discovery Christoph F. Eick Department of Computer Science, University of Houston 1.Motivation: Why is Regional Knowledge Important? 2.Region Discovery Framework 3.A Family of Clustering Algorithms for Region Discovery 4.Case Studies—Extracting Regional Knowledge: Regional Regression Regional Association Rule Mining Regional Models of User Behaviour on the Internet [Co-location Mining] 5.[Analyzing Related Datasets] 6.Summary 1

Ch. Eick: Regional Knowledge Discovery Spatial Data Mining Definition: Spatial data mining is the process of discovering interesting patterns from large spatial datasets; it organizes by location what is interesting. Challenges: –Information is not uniformly distributed –Autocorrelation –Space is continuous –Complex spatial data types –Large dataset sizes and many possible patterns –Patterns exist at different sets level of resolution –Importance of maps as summaries –Importance of regional Knowledge 2

Ch. Eick: Regional Knowledge Discovery Why Regional Knowledge Important in Spatial Data Mining? It has been pointed out in the literature that “whole map statistics are seldom useful”, that “most relationships in spatial data sets are geographically regional, rather than global”, and that “there is no average place on the Earth’s surface” [Goodchild03, Openshaw99]. Simpson’s Paradox – global models may be inconsistent with regional models [Simpson1951]. Therefore, it is not surprising that domain experts are mostly interested in discovering hidden patterns at a regional scale rather than a global scale. 3

Ch. Eick: Regional Knowledge Discovery Example: Regional Association Rules Rule 1 Rule 3 Rule 2 Rule 4 Scopes of the 4 Rules in 4

Ch. Eick: Regional Knowledge Discovery Goal of the Presented Research Develop and implement an integrated computational framework useful for data analysts and scientists from diverse disciplines for extracting regional knowledge in spatial datasets in a highly automated fashion. 5

Ch. Eick: Regional Knowledge Discovery Related Work  Spatial co-location pattern discovery [Shekhar et al.]  Spatial association rule mining [Han et al.]  Localized associations in segments of the basket data [Yu et al.]  Spatial statistics on hot spot detection [Tay and Brimicombe et al.]  There is some work on geo-regression techniques (to be discussed later)  … 6 Comment: Most work centers on extraction global knowledge from spatial datasets

Department of Computer Science Preview: A Framework for Extracting Regional Knowledge from Spatial Datasets RD-Algorithm Application 1: Supervised Clustering [EVJW07] Application 2: Regional Association Rule Mining and Scoping [DEWY06, DEYWN07] Application 3: Find Interesting Regions with respect to a Continuous Variables [CRET08] Application 4: Regional Co-location Mining Involving Continuous Variables [EPWSN08] Application 5: Find “representative” regions (Sampling) Application 6: Regional Regression [CE09] Application 7: Multi-Objective Clustering [JEV09] Application 8: Change Analysis in Related Datasets [RE09] Wells in Texas: Green: safe well with respect to arsenic Red: unsafe well  =1.01  =1.04 UH-DMML 7

Department of Computer ScienceChristoph F. Eick 2. Region Discovery Framework 8

Department of Computer ScienceChristoph F. Eick Region Discovery Framework2  We assume we have spatial or spatio-temporal datasets that have the following structure: ( ; ) e.g. (longitude, lattitude, class_variable) or (longitude, lattitude, continous_variable)  Clustering occurs in space of the spatial attributes; regions are found in this space.  The non-spatial attributes are used by the fitness function but neither in distance computations nor by the clustering algorithm itself.  For the remainder of the talk, we view region discovery as a clustering task and assume that regions and clusters are the same. 9

Department of Computer ScienceChristoph F. Eick Region Discovery Framework3 The algorithms we currently investigate solve the following problem: Given: A dataset O with a schema R A distance function d defined on instances of R A fitness function q(X) that evaluates clusterings X={c 1,…,c k } as follows: q(X)=  c  X reward(c)=  c  X i(c)  size(c)  with  1 Objective: Find c 1,…,c k  O such that: 1.c i  c j =  if i  j 2.X={c 1,…,c k } maximizes q(X) 3.All cluster c i  X are contiguous (each pair of objects belonging to c i has to be delaunay-connected with respect to c i and to d) 4.c 1  …  c k  O 5.c 1,…,c k are usually ranked based on the reward each cluster receives, and low reward clusters are frequently not reported 10

Department of Computer ScienceChristoph F. Eick Measure of Interestingness i(c)  The function i(c) is an interestingness measure for a region c, a quantity based on domain interest to reflect how “newsworthy” the region is.  In our past work, we have designed a suite of measures of interestingness for:  Supervised Clustering [PKDD06]  Hot spots and cool spots [ICDM06]  Scope of regional patterns [SSTDM07, GE011]  Co-location patterns involving continuous variables [PAKDD08, ACM-GIS08]  High-variance regions involving a continuous variable [PAKDD09]  Regional Regression [ACM-GIS09] 11

Department of Computer ScienceChristoph F. Eick Example1: Finding Regional Co-location Patterns in Spatial Data Objective: Find co-location regions using various clustering algorithms and novel fitness functions. Applications: 1. Finding regions on planet Mars where shallow and deep ice are co-located, using point and raster datasets. In figure 1, regions in red have very high co- location and regions in blue have anti co-location. 2. Finding co-location patterns involving chemical concentrations with values on the wings of their statistical distribution in Texas ’ ground water supply. Figure 2 indicates discovered regions and their associated chemical patterns. Figure 1: Co-location regions involving deep and shallow ice on Mars Figure 2: Chemical co-location patterns in Texas Water Supply 12

Department of Computer ScienceChristoph F. Eick Example 2: Regional Regression Geo-regression approaches: Multiple regression functions are used that vary depending on location. Regional Regression: I. To discover regions with strong relationships between dependent & independent variables II. Construct regional regression functions for each region III. When predicting the dependent variable of an object, use the regression function associated with the location of the object 13

Department of Computer ScienceChristoph F. Eick Challenges for Region Discovery 1.Recall and precision with respect to the discovered regions should be high 2.Definition of measures of interestingness and of corresponding parameterized reward-based fitness functions that capture “what domain experts find interesting in spatial datasets” 3.Detection of regions at different levels of granularities (from very local to almost global patterns) 4.Detection of regions of arbitrary shapes 5.Necessity to cope with very large datasets 6.Regions should be properly ranked by relevance (reward) 7.Design and implementation of clustering algorithms that are suitable to address challenges 1, 3, 4, 5 and 6. 14

Clustering with Plug-in Fitness Functions  In the last 5 years, my research group developed families of clustering algorithms that find contiguous spatial clusters that by maximizing a plug-in fitness function.  This work is motivated by a mismatch between evaluation measures of traditional clustering algorithms (such as cluster compactness) and what domain experts are actually looking for.  Additionally, more recently hotspot discovery techniques that find interesting regions for polygonal datasets, such as zip- code-based datasets are developed. 15

Department of Computer ScienceChristoph F. Eick 3. Current Suite of Clustering Algorithms  Representative-based: SCEC, SRIDHCR, SPAM, CLEVER  Grid-based: SCMRG, SCHG  Agglomerative: MOSAIC, SCAH  Density-based: SCDE, DCONTOUR Clustering Algorithms Density-based Agglomerative-basedRepresentative-based Grid-based 16

Department of Computer ScienceChristoph F. Eick Representative-based Clustering Attribute2 Attribute Objective: Find a set of objects O R such that the clustering X obtained by using the objects in O R as representatives minimizes q(X). Characteristic: cluster are formed by assigning objects to the closest representative Popular Algorithms: K-means, K-medoids, CLEVER,… 17

 Is a representative-based clustering algorithm, similar to PAM.  Searches variable number of clusters and larger neighborhood sizes to battle premature termination and randomized hill climbing and adaptive sampling to reduce complexity.  In general, new clusters are generated in the neighborhood of the current solution by replacing, inserting, and replacing representatives.  Searches for optimal number of clusters CLEVER [ACM-GIS08] 18

Department of Computer ScienceChristoph F. Eick Advantages of Grid-based Clustering Algorithms  fast:  No distance computations  Clustering is performed on summaries and not individual objects; complexity is usually O(#populated-grid-cells) and not O(#objects)  Easy to determine which clusters are neighboring  Shapes are limited to union of grid-cells 19

Department of Computer Science Ideas SCMRG (Divisive, Multi-Resolution Grids) Cell Processing Strategy 1. If a cell receives a reward that is larger than the sum of its rewards its ancestors: return that cell. 2. If a cell and its ancestor do not receive any reward: prune 3. Otherwise, process the children of the cell (drill down) 20

Department of Computer Science Code SCMRG 21

Department of Computer ScienceChristoph F. Eick 4. Case Studies Regional Knowledge Extraction 4.1 Regional Regression 4.2 Regional Association Rule Mining & Scoping 4.3 Association-List Based Discrepancy Mining of User Behavior 4.4 Co-location Mining to be skipped! 22

Motivation 1st law of geography: “Everything is related to everything else but nearby things are more related than distant things” (Tobler)  Frequently, coefficient estimates in spatial datasets spatially vary.  Question: How do we capture the regional variation of regression coefficients? 4.1 Regional Regression 24

Motivation  Regression Trees  Data is split in a top-down approach using a greedy algorithm  Discovers only rectangle shapes  Geographically Weighted Regression(GWR)  an instance-based, local spatial statistical technique used to analyze spatial non-stationarity.  generates a separate regression equation for a set of observation points-determined using a grid or kernel  weight assigned to each observation is based on a distance decay function centered on observation. Other Geo-Regression Analysis Methods 25

Motivation Regression Result: A positive linear regression line (Arsenic increases with increasing Fluoride concentration) Example 1: Why We Need Regional Knowledge? Fluoride Arsenic 26

Motivation  A negative linear Regression line in both locations (Arsenic decreases with increasing Fluoride concentration)  A reflection of Simpson’s paradox. Example 1: Why We Need Regional Knowledge? Fluoride Arsenic Location 1 Location 2 27

Motivation Example 2: Houston House Price Estimate  Dependent variable: House_Price  Independent variables: noOfRooms, squareFootage, yearBuilt, havePool, attachedGarage, etc.. 28

Global Regression (OLS) produces the coefficient estimates, R 2 value, and error etc..  a single global model This model assumes all areas have same coefficients E.g. attribute havePool has a coefficient of +9,000 (~having a pool adds $9,000 to a house price) In reality this changes. A house of $100K and a house of $500K or different zip codes or locations. Having a pool in a house in luxury areas is very different (~$40K) than having a pool in a house in Suburbs(~$5K). Example 2: Houston House Price Estimate Motivation 29

Motivation Example 2: Houston House Price Estimate $180,000 $350,000  Houses A, B have very similar characteristics  OLS produces single parameter estimates for predictor variables like noOfRooms, squareFootage, yearBuilt, etc 31

Motivation Example 2: Houston House Price Estimate  If we use zip code as regions, they are in same region  If we use a grid structure  They are in different regions but some houses similar to B (lake view) are in same region with A and this will effect coefficient estimate  More importantly, the house around U-shape lake show similar pattern and should be in the same region, we miss important information. 32

We need to discover arbitrary shaped regions, and not rely on some a priori defined artificial boundaries Our Approach: Capture the True Pattern Structure! Problems to be solved: 1. Find regions whose objects have a strong relationship between the dependent variable and independent variables 2. Extracting Regional Regression Functions 3. Develop a method to select which regression function to use for a new object to be predicted. Motivation 33

So, what Can we use as Interestingness?  The natural first candidate is Adjusted R 2. R-sq is a measure of the extent to which the total variation of the dependent variable is explained by the model.  R-sq alone is not a good measure to assess the goodness of fit; only deals with the bias of the model & ignores the complexity of model which leads to overfitting  There are better model selection criteria to balance the tradeoff between bias and the variance. Methodology 35

Fitness Function Candidates  R 2 -based fitness functions  Fitness functions that additionally consider model complexity, in addition to goodness of fit, such as AIC or BIC  Regularization approaches that penalize large coefficients.  Fitness functions that employ validation sets that provide a better measure for the generalization error—the model’s performance on unseen examples  An improvement of the previous approach that additionally considers training set/test set similarity  Combination of approaches mentioned above Methodology 36

R-sq Based Fitness Function Given; and  The interestingness is:  To battle the tendency towards having small size regions with high correlation (false correlation):  used scaled version of the fitness function  employed a parameter to limit the min. size of the region  The Rsq-based fitness function then becomes; Methodology 37

AIC Based Fitness Function (AICFitness) We prefer Akaike’s Information Criterion (AIC) because;  it takes model complexity (number of observations etc..) into consideration more effectively  AIC provides a balance between bias and variance, and is estimated using the following formula:  Variations of AIC including AIC u [McQuarrie] which is used for small size data is available  good fit for our small size regions Methodology 38

AIC Based Fitness Function (AICFitness)  AIC-based Interestingness – i AIC (r)  AICFitness function then becomes  AICFitness function repeatedly applies regression analysis during the search for the optimal set of regions which overall provides best AIC values (minimum) Methodology 39

Controlling Regional Granularity  β is used to control the number of regions to be discovered, thus overall model complexity.  Finding a good value for β means striking the right balance between underfitting and overfitting for a given dataset.  Small values for   small number of regions; large values for   large number of regions Methodology Reminder—Region Discovery Framework Fitness Function: q(X)=  c  X reward(c)=  c  X i(c)  size(c)  40

Generalization Error Improvement (SSE_TE) Experiments & Results  Discovered regions and their regional regression coefficients perform better prediction compared to the global model  Some regions with very high error reduce the overall accuracy but still 27% improvement. (future work item)  Relationship between variables spatially varies β SSE_TE (GL) SSE_TE (REG 2 ) SSE Improvement % of objects better prediction 1.117,18212,56627%72% 1.717,18214,79926%65% Generalization Error Results - Boston Housing Data 41

Experiments & Results  Regional regression coefficients perform just slightly better prediction  Some due to external factors, e.g. toxic waste, power plant (analyzed previously using PCAFitness approach, MLDM09)  Some regions with very high error reduce the overall accuracy  Still around 60% of objects are better predicted  Open for improvement; new fitness functions β SSE_TE (GL) SSE_TE (REG 2 ) SSE Improvement % of objects better prediction , 57898,8793.6%57% , 57892, %61% Generalization Error Results – Arsenic Data 42

Department of Computer Science 4.2 A Framework for Regional Association Rule Mining and Scoping [GeoInformatica10] Step 1: Region Discovery Step 2: Regional Association Rule Mining Step 2: Regional Association Rule Mining Step 3: Regional Association Rule Scoping Step 3: Regional Association Rule Scoping Arsenic hot spots An association rule a is discovered. Scope of the rule a 43

Department of Computer ScienceChristoph F. Eick Arsenic Hot Spots and Cool Spots Step 1: Region Discovery Step 2: Regional Association Rule Mining Step 2: Regional Association Rule Mining Step 3: Regional Association Rule Scoping Step 3: Regional Association Rule Scoping 44

Department of Computer ScienceChristoph F. Eick Example Regional Association Rules Step 1: Region Discovery Step 2: Regional Association Rule Mining rule 1 rule 3 rule 2 rule 4 Step 3: Regional Association Rule Scoping Step 3: Regional Association Rule Scoping 45

Department of Computer ScienceChristoph F. Eick Region vs. Scope  Scope of an association rule indicates how regional or global a local pattern is.  The region, where an association rule is originated, is a subset of the scope where the association rule holds. 46

Department of Computer ScienceChristoph F. Eick Association Rule Scope Discovery Framework Let a be an association rule, r be a region, conf(a,r) denotes the confidence of a in region r, and sup(a,r) denotes the support of a in r. Goal: Find all regions for which an associate rule a satisfies its minimum support and confidence threshold; regions in which a’s confidence and support are significantly higher than the min-support and min-conf thresholds receive higher rewards. Association Rule Scope Discovery Methodology: For each rule a that was discovered for region r’, we run our region discovery algorithm that defines the interestingness of a region r i with respect to an association rule a as follows: Remarks:  Typically  1 =  2 =0.9;  =2 (confidence increase is more important than support increase)  Obviously the region r’ from which rule a originated or some variation of it should be “rediscovered” when determining the scope of a. 47

Department of Computer ScienceChristoph F. Eick Regional Association Rule Scoping Ogallala Aquifer Gulf Coast Aquifer 48

Department of Computer ScienceChristoph F. Eick Fine Tuning Confidence and Support  We can fine tune the measure of interestingness for association rule scoping by changing the minimum confidence and support thresholds. 49

Department of Computer ScienceChristoph F. Eick 4.3 (Regional) Models for Internet User Behaviour Problem: We are interested in finding spatial patterns with respect to a performance variable based on some context that is described using a set of variables. Main Theme: We try to find factors that influence if a user clicks for given ad (e.g. CTR changes based on the keywords that occur in the ad / socio-ecomic factors / proximity to spatial objects of a particular type/...) Complication: Datasets are very large, most data are only available at zip-code level. Our subtopic: As usual, we are interested in extracting knowledge concerning the „regional variation of clicking behavior“. Contributors: Ruth Miller, Chun-sheng Chen, Yahoo! Colloaborator: Abraham Bagherjeiran 50

Department of Computer ScienceChristoph F. Eick Data Set: Yahoo! Contextual Ads 1.Data Source:  Keystone (contextual Ads) Dataset: January-March, 2009  WOEID database  used to identify the user’s location who see the ad  to find the neighboring zip codes given a zip code 2.Experiments are based on a subset from the keystone data set: 1.Ads without geo-targeting tags 2.Only the rank 1 ads 3.Shown on top 5 Yahoo! domains (Y!.finance|Y!.news| Y!. sports| Y!. groups| Y!. maps) 4.Compute the CTR and conversion rate for each zip code  Regional CTR threshold: a zip code must has at least 1000 impressions and 100 clicks 3.Final Dataset: 13,869 zip codes with their CTR & conversion rate 4.Goal: Find interesting associations of this dataset with co-location and census datasets 50a

Department of Computer Science Data Set: Census data  US Census 2000  5 Digit Zip Code  Total Population  Total Population who are White  Total Population who are African American  Total Population who are American Indian  Total Population who are Asian  Total Population who are Hawaiian or Pacific Islander  Total Population who are Some other Race  Total Population who are 2 or more Races  Percent of Total Population who are White  Percent of Total Population who are African American  Percent of Total Population who are American Indian  Percent of Total Population who are Asian  Percent of Total Population who are Hawaiian or Pacific Islander  Percent of Total Population who are Some other Race  Percent of Total Population who are 2 or more Races  Per Capita Income  Percent of Total Population with Education up to 12 th grade  Percent of Total Population with Education up to Bachelors Degree  Percent of Total Population with Education up to Masters Degree  Percent of Total Population with Education up to Ph.D. or Profession Degree  Percent of Total Population with Education higher than Masters Degree 50b

Department of Computer ScienceChristoph F. Eick Global Interestingness Analysis  Comparison of US zip codes to Zip Codes with Whole Food Markets stores  Zip codes with Whole Food Market stores has a lower overall CTR but have a higher number of per person impression and click counts. 50v

Department of Computer Science ZIPS Hotspot Discovery Algorithm Input: a interestingness function F, a list of n initial zip regions zlist, interestingness threshold t Set HotspotList := empty Set NeighborList := empty For each region z in zlist { If(F(z)>t) { Add (neighbor zip codes of z – Hotspots) and add to the NeighborList; While (size of NeighborList > 0) { Remove one zip code M from NeighborList; If (F(M+z) > t){ Merge M to z; } Mark M as processed and add unprocessed neighbor zip codes of M to the NeighborList ; } Add z to HotspotList; } } Return HotspotList;  An Agglomerative Growing Algorithm; it starts with a seed zip code merges neighboring zip codes, if the resulting region is above an interestingness threshold  Neighboring zip codes are obtained from a lookup table created from the WOEID database 50d

Department of Computer ScienceChristoph F. Eick ZIPS Output Sample Regions for which the Correlation between percentage of Bachelors degrees and CTR is below  e Example: Negative Correlation Interestingness Hotspots between Bachelor’s degree & CTR

Department of Computer ScienceChristoph F. Eick LA Area Neg. Corr. Income vs. CTR Interestingness Threshold -0.8 Zip codes of interest is outlined in yellow 50f

Department of Computer ScienceChristoph F. Eick Scatter Plot of LA CTR/Income Z-Scores 50g

Department of Computer ScienceChristoph F. Eick LA Income vs. CTR 50h

Department of Computer ScienceChristoph F. Eick North East DC Area Interestingness Threshold 0.8 Zip codes of interest is outlined in yellow 50i

Department of Computer ScienceChristoph F. Eick Scatter Plot of NE-DC Income/CTR Z-score 50j

Department of Computer ScienceChristoph F. Eick NE-DC Income vs. CTR 50k

Department of Computer ScienceChristoph F. Eick Accomplishments Yahoo! Project “Completed “ Tasks: a.Frameworks to Analyze Spatial Associations of a Continuous Variable with Other Factors b.Spatial Hotspot Discovery and Regional Scoping Techniques c.Finding (Spatial) Correlation-based Associations of CTR with other Factors (mostly based on Contextual Ad Datasets) d.Dataset Set Creation (Mostly for task c)  Census-based Datasets (each of the dataset is done for 5 digit zip code, summarized into three digit zip codes regions (by combine all the zip codes with similar first 3 digits) and 2 digit zip code regions):  Co-location Datasets  US zip code boundary polygons (for visualization purpose)  51

Department of Computer ScienceChristoph F. Eick Accomplishments Yahoo! Project2 Partially Completed Tasks a.Visualization Tools that Display Interestingness Hotspots b.Analyzing Relationships between CTR and Conversions c.Finding Co-location based Associations of CTR d.Finding Regional and Global Patterns based on Sets of Binary Variables Proposed and Just Started Tasks: a.Geo-feature Creation and Evaluation b.Mining for Promising Binary Contexts for Contiguous Variables c.Mining the Look-a-like Modeling Datasets d.Generalizing CLEVER for Interestingness Hotspot Discovery 52

Department of Computer ScienceChristoph F. Eick Subtopics: Disparity Analysis/Emergent Pattern Discovery (“how do two groups differ with respect to their patterns?”) [SDE10] Change Analysis ( “what is new/different?”) [CVET09] Correspondence Clustering (“mining interesting relationships between two or more datasets”) [RE10] Meta Clustering (“cluster cluster models of multiple datasets”) Analyzing Relationships between Polygonal Cluster Models Example: Analyze Changes with Respect to Regions of High Variance of Earthquake Depth. Novelty (r’) = (r’—(r1  …  rk)) Emerging regions based on the novelty change predicate Time 1 Time 2 5. Methodologies and Tools to Analyze Related Datasets 53

Department of Computer ScienceChristoph F. Eick 6. Summary 1.A framework for region discovery that relies on additive, reward-based fitness functions and views region discovery as a clustering problem has been introduced. 2.Families of clustering algorithms and families of measures of interestingness are provided that form the core of the framework. 3.Evidence concerning the usefulness of the framework for regional association rule mining, correlation analysis, regional regression, and co-location mining has been presented. 4.The special challenges in designing clustering algorithms for region discovery have been identified. Current work centers on the parallel implementation of some of those algorithms. 5.The ultimate vision of this research is the development of region discovery engines that assist data analysts and scientists in finding interesting regions in spatial datasets. 54

Department of Computer ScienceChristoph F. Eick Other Contributors to the Work Presented Today Graduated PhD Students:  Wei Ding (Regional Association Rule Mining, Grid-based Clustering)  Rachsuda Jiamthapthaksin (Agglomerative Clustering, Multi-Run Clustering)  Oner Ulvi Celepcikay (Regional Regression)  Vadeerat Risurongkawong (Analyzing Multiple Datasets, Change Analysis) Current PhD Students  Chun-sheng Chen (Density based Clustering, Regional Knowledge Extraction)  Ruth Miller (Dataset Creation, Models for Internet Behavior) Graduated Master Students  Rachana Parmar (CLEVER, Co-location Mining)  Seungchan Lee (Grid-based Clustering, Agglomerative Clustering)  Dan Jiang (Density-based Clustering, Co-location Mining)  Jing Wang (Grid-based and Representative-based Clustering) Software Platform and Software Design  Abraham Bagherjeiran (PhD student UH, now at Yahoo!) Domain Experts  Tomasz Stepinski (Lunar and Planetary Institute, Houston, Texas)  J.-P. Nicot (Bureau of Economic Geology, UT, Austin)  Michael Twa (College of Optometry, University of Houston) 55

Department of Computer Science Inputs: Dataset O, k’, neighborhood-size, p, p’,  Outputs: Clustering X, fitness q Algorithm: 1. Create a current solution by randomly selecting k’ representatives from O. 2. Create p neighbors of the current solution randomly using the given neighborhood definition. 3. If the best neighbor improves the fitness q, it becomes the current solution. Go back to step If the fitness does not improve, the solution neighborhood is re-sampled by generating p’ more neighbors. If re-sampling does not lead to a better solution, terminate returning the current solution; otherwise, go back to step 2 replacing the current solution by the best solution found by re- sampling. CLEVER Pseudo Code 56

Department of Computer ScienceChristoph F. Eick A example of the WOEID neighbors lookup table {( ),( ),( ),( ),( )} {( ),( ),( )} {( ),( ),( ),( ),( ),( ),( )} {( ),( ),( ),( ),( ),( ),( ),( )} {( ),( ),( ),( )} {( ),( ),( ),( ),( ),( ),( )} {( ),( ),( ),( ),( ),( ),( )} {( ),( ),( )} {( ),( )} {( ),( ),( ),( ),( ),( ),( ),( ),( )} {( ),( ),( )} {( ),( )} The size of the table: 29,692 lines 50e