Presentation is loading. Please wait.

Presentation is loading. Please wait.

Frameworks and Algorithms for Regional Knowledge Discovery

Similar presentations


Presentation on theme: "Frameworks and Algorithms for Regional Knowledge Discovery"— Presentation transcript:

1 Frameworks and Algorithms for Regional Knowledge Discovery
Christoph F. Eick Department of Computer Science, University of Houston Motivation: Why is Regional Knowledge Important? Region Discovery Framework A Family of Clustering Algorithms for Region Discovery Case Studies—Extracting Regional Knowledge: Regional Regression Regional Association Rule Mining Regional Models of User Behaviour on the Internet [Co-location Mining] [Analyzing Related Datasets] Summary In this talk, a framework for region discovery in spatial datasets will be introduced. In the first part of the talk motivates the need for such a framework.Next, more details about the framework will be given and its use for solving hotspot discovery problems will be discussed. Finally, the generalizability of the framework will be discussed.

2 Spatial Data Mining Definition: Spatial data mining is the process of discovering interesting patterns from large spatial datasets; it organizes by location what is interesting. Challenges: Information is not uniformly distributed Autocorrelation Space is continuous Complex spatial data types Large dataset sizes and many possible patterns Patterns exist at different sets level of resolution Importance of maps as summaries Importance of regional Knowledge One other area of focus is spatial data mining. Spatial data mining centers on finding interesting patterns in spatial datasets. Spatial data have several unique characteristics, such as the autocorrelation, the continuous nature of space, complex spatial data types and the importance of regional knowledge. Spatial data mining techniques have to address these challenges.

3 Why Regional Knowledge Important in Spatial Data Mining?
It has been pointed out in the literature that “whole map statistics are seldom useful”, that “most relationships in spatial data sets are geographically regional, rather than global”, and that “there is no average place on the Earth’s surface” [Goodchild03, Openshaw99]. Simpson’s Paradox – global models may be inconsistent with regional models [Simpson1951]. Therefore, it is not surprising that domain experts are mostly interested in discovering hidden patterns at a regional scale rather than a global scale.

4 Example: Regional Association Rules
Scopes of the 4 Rules in

5 Goal of the Presented Research
Develop and implement an integrated computational framework useful for data analysts and scientists from diverse disciplines for extracting regional knowledge in spatial datasets in a highly automated fashion.

6 Related Work Spatial co-location pattern discovery [Shekhar et al.]
Spatial association rule mining [Han et al.] Localized associations in segments of the basket data [Yu et al.] Spatial statistics on hot spot detection [Tay and Brimicombe et al.] There is some work on geo-regression techniques (to be discussed later) Comment: Most work centers on extraction global knowledge from spatial datasets

7 Preview: A Framework for Extracting Regional Knowledge from Spatial Datasets
7 Application 1: Supervised Clustering [EVJW07] Application 2: Regional Association Rule Mining and Scoping [DEWY06, DEYWN07] Application 3: Find Interesting Regions with respect to a Continuous Variables [CRET08] Application 4: Regional Co-location Mining Involving Continuous Variables [EPWSN08] Application 5: Find “representative” regions (Sampling) Application 6: Regional Regression [CE09] Application 7: Multi-Objective Clustering [JEV09] Application 8: Change Analysis in Related Datasets [RE09] b=1.01 RD-Algorithm In contrast to other work in spatial data mining, our work centers on extracting regional or local knowledge from spatial datasets, and not on finding global patters. In particular, we are interested in assisting scientists in finding interesting regions in spatial datasets based on their particular notation of interestingness. b=1.04 Wells in Texas: Green: safe well with respect to arsenic Red: unsafe well UH-DMML

8 2. Region Discovery Framework
8

9 Region Discovery Framework2
9 We assume we have spatial or spatio-temporal datasets that have the following structure: (<spatial attributes>;<non-spatial attributes>) e.g. (longitude, lattitude, class_variable) or (longitude, lattitude, continous_variable) Clustering occurs in space of the spatial attributes; regions are found in this space. The non-spatial attributes are used by the fitness function but neither in distance computations nor by the clustering algorithm itself. For the remainder of the talk, we view region discovery as a clustering task and assume that regions and clusters are the same. Our proposed framework makes the following assumptions…

10 Region Discovery Framework3
10 The algorithms we currently investigate solve the following problem: Given: A dataset O with a schema R A distance function d defined on instances of R A fitness function q(X) that evaluates clusterings X={c1,…,ck} as follows: q(X)= cX reward(c)=cX i(c) *size(c) with b1 Objective: Find c1,…,ck  O such that: cicj= if ij X={c1,…,ck} maximizes q(X) All cluster ciX are contiguous (each pair of objects belonging to ci has to be delaunay-connected with respect to ci and to d) c1…ck  O c1,…,ck are usually ranked based on the reward each cluster receives, and low reward clusters are frequently not reported Our appoach employs reward-based fitness functions that have the following form; clusters receive rewards based on the interestingness and rewards increase nonlinearly with cluster-size dependent on the value of b, favoring clusters c with more objects. This requirement is important because we want to encourage region discovery algorithms to merge neighboring clusters if they have similar characteristics. The quality of a clustering is the sum of the rewards its individual clusters receive.

11 Measure of Interestingness i(c)
11 The function i(c) is an interestingness measure for a region c, a quantity based on domain interest to reflect how “newsworthy” the region is. In our past work, we have designed a suite of measures of interestingness for: Supervised Clustering [PKDD06] Hot spots and cool spots [ICDM06] Scope of regional patterns [SSTDM07, GE011] Co-location patterns involving continuous variables [PAKDD08, ACM-GIS08] High-variance regions involving a continuous variable [PAKDD09] Regional Regression [ACM-GIS09]

12 Example1: Finding Regional Co-location Patterns in Spatial Data
12 Figure 1: Co-location regions involving deep and shallow ice on Mars Figure 2: Chemical co-location patterns in Texas Water Supply Objective: Find co-location regions using various clustering algorithms and novel fitness functions. Applications: 1. Finding regions on planet Mars where shallow and deep ice are co-located, using point and raster datasets. In figure 1, regions in red have very high co-location and regions in blue have anti co-location. 2. Finding co-location patterns involving chemical concentrations with values on the wings of their statistical distribution in Texas’ ground water supply. Figure 2 indicates discovered regions and their associated chemical patterns.

13 Example 2: Regional Regression
13 Geo-regression approaches: Multiple regression functions are used that vary depending on location. Regional Regression: To discover regions with strong relationships between dependent & independent variables Construct regional regression functions for each region When predicting the dependent variable of an object, use the regression function associated with the location of the object

14 Challenges for Region Discovery
14 Challenges for Region Discovery Recall and precision with respect to the discovered regions should be high Definition of measures of interestingness and of corresponding parameterized reward-based fitness functions that capture “what domain experts find interesting in spatial datasets” Detection of regions at different levels of granularities (from very local to almost global patterns) Detection of regions of arbitrary shapes Necessity to cope with very large datasets Regions should be properly ranked by relevance (reward) Design and implementation of clustering algorithms that are suitable to address challenges 1, 3, 4, 5 and 6.

15 Clustering with Plug-in Fitness Functions
In the last 5 years, my research group developed families of clustering algorithms that find contiguous spatial clusters that by maximizing a plug-in fitness function. This work is motivated by a mismatch between evaluation measures of traditional clustering algorithms (such as cluster compactness) and what domain experts are actually looking for. Additionally, more recently hotspot discovery techniques that find interesting regions for polygonal datasets, such as zip-code-based datasets. 15

16 3. Current Suite of Clustering Algorithms
16 Representative-based: SCEC, SRIDHCR, SPAM, CLEVER Grid-based: SCMRG, SCHG Agglomerative: MOSAIC, SCAH Density-based: SCDE, DCONTOUR Density-based Grid-based Representative-based Agglomerative-based Clustering Algorithms

17 Representative-based Clustering
17 2 Attribute1 1 3 4 Attribute2 The objective of representative-based supervised clustering is… Objective: Find a set of objects OR such that the clustering X obtained by using the objects in OR as representatives minimizes q(X). Characteristic: cluster are formed by assigning objects to the closest representative Popular Algorithms: K-means, K-medoids, CLEVER,…

18 CLEVER [ACM-GIS08] Is a representative-based clustering algorithm, similar to PAM. Searches variable number of clusters and larger neighborhood sizes to battle premature termination and randomized hill climbing and adaptive sampling to reduce complexity. In general, new clusters are generated in the neighborhood of the current solution by replacing, inserting, and replacing representatives. Searches for optimal number of clusters 18

19 Advantages of Grid-based Clustering Algorithms
19 fast: No distance computations Clustering is performed on summaries and not individual objects; complexity is usually O(#populated-grid-cells) and not O(#objects) Easy to determine which clusters are neighboring Shapes are limited to union of grid-cells

20 Ideas SCMRG (Divisive, Multi-Resolution Grids)
20 Ideas SCMRG (Divisive, Multi-Resolution Grids) Cell Processing Strategy 1. If a cell receives a reward that is larger than the sum of its rewards its ancestors: return that cell. 2. If a cell and its ancestor do not receive any reward: prune 3. Otherwise, process the children of the cell (drill down) Another challenge in region discovery is that, in contrast to traditional clustering, clusters are not really equal. Therefore, it is beneficiary for clustering algorithms to spend its resources on enhancing promising clusters, instead of trying to enhance clusters that unlikely to receive a reward. Consequently, pruning is an imporant issue in developing efficient region discovery algorithms. The depicted algorithm is a divisive cluster algorithm that employs multi-resolution grids. It relies on the following cell processing strategy.

21 Code SCMRG 21

22 4. Case Studies Regional Knowledge Extraction
22 4.1 Regional Regression 4.2 Regional Association Rule Mining & Scoping 4.3 Association-List Based Discrepancy Mining of User Behavior 4.4 Co-location Mining to be skipped!

23 4.1 REG^2: A Framework of Regional Regression
23 Motivation: Regression functions spatially vary, as they are not constant over space Goal: To discover regions with strong relationships between dependent & independent variables and extract their regional regression functions. Discovered Regions and Regression Functions REG^2 Outperforms Other Models in SSE_TR Clustering algorithms with plug-in fitness functions are employed to find such region; the employed fitness functions reward regions with a low generalization error. Various schemes are explored to estimate the generalization error: example weighting, regularization, penalizing model complexity and using validation sets,… AIC Fitness VAL Fitness RegVAL Fitness WAIC Fitness Arsenic 5.01% 11.19% 3.58% 13.18% Boston 29.80% 35.69% 38.98% 36.60% Regularization Improves Prediction Accuracy Skip!

24 Motivation Regional Knowledge & Regression
24 Motivation Regional Knowledge & Regression 1st law of geography: “Everything is related to everything else but nearby things are more related than distant things” (Tobler) Coefficient estimates in geo-referenced datasets spatially vary  we need regression methods to discover regional coefficient estimates that captures underlying structure of data. Using human-made boundaries (zip code etc.) is not good idea since spatial variation is rarely rectangular. I would like to start with our motivation and before we even start talking about, regional regression or regression altogether, lets talk about why we need regional knowledge. We believe spatial data or geo-referenced data contains many patterns but they are visible on regional level but not at global level. I will give examples to support these claims in next 2 slides. But here let me give you an example that is not in the paper and I got inspired by Dr Hanrahan, our keynote speaker from yesterday. Remember he talked about Hurricane Katrina and as a person from Houston TX where 150,000 of Katrina evacuees were moved this was one of the early things we woked on. We have student data from a large school district in Houston around 7,000 students.

25 Motivation Other Geo-Regression Analysis Methods Regression Trees
25 Motivation Other Geo-Regression Analysis Methods Regression Trees Data is split in a top-down approach using a greedy algorithm Discovers only rectangle shapes Geographically Weighted Regression(GWR) an instance-based, local spatial statistical technique used to analyze spatial non-stationarity. generates a separate regression equation for a set of observation points-determined using a grid or kernel weight assigned to each observation is based on a distance decay function centered on observation. I would like to start with our motivation and before we even start talking about, regional regression or regression altogether, lets talk about why we need regional knowledge. We believe spatial data or geo-referenced data contains many patterns but they are visible on regional level but not at global level. I will give examples to support these claims in next 2 slides. But here let me give you an example that is not in the paper and I got inspired by Dr Hanrahan, our keynote speaker from yesterday. Remember he talked about Hurricane Katrina and as a person from Houston TX where 150,000 of Katrina evacuees were moved this was one of the early things we woked on. We have student data from a large school district in Houston around 7,000 students.

26 Motivation Example 1: Why We Need Regional Knowledge?
26 Motivation Example 1: Why We Need Regional Knowledge? Arsenic Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Fluoride Regression Result: A positive linear regression line (Arsenic increases with increasing Fluoride concentration)

27 Motivation Example 1: Why We Need Regional Knowledge?
27 Motivation Example 1: Why We Need Regional Knowledge? Location 1 Location 2 Arsenic Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Fluoride A negative linear Regression line in both locations (Arsenic decreases with increasing Fluoride concentration) A reflection of Simpson’s paradox.

28 Motivation Example 2: Houston House Price Estimate
28 Motivation Example 2: Houston House Price Estimate Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Dependent variable: House_Price Independent variables: noOfRooms, squareFootage, yearBuilt, havePool, attachedGarage, etc..

29 Motivation Example 2: Houston House Price Estimate
29 Motivation Example 2: Houston House Price Estimate Global Regression (OLS) produces the coefficient estimates, R2 value, and error etc.. a single global model This model assumes all areas have same coefficients E.g. attribute havePool has a coefficient of +9,000 (~having a pool adds $9,000 to a house price) In reality this changes. A house of $100K and a house of $500K or different zip codes or locations. Having a pool in a house in luxury areas is very different (~$40K) than having a pool in a house in Suburbs(~$5K).

30 Motivation Example 2: Houston House Price Estimate $350,000 $180,000
31 Motivation Example 2: Houston House Price Estimate $350,000 $180,000 Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Houses A, B have very similar characteristics OLS produces single parameter estimates for predictor variables like noOfRooms, squareFootage, yearBuilt, etc

31 Motivation Example 2: Houston House Price Estimate
32 Motivation Example 2: Houston House Price Estimate If we use zip code as regions, they are in same region If we use a grid structure They are in different regions but some houses similar to B (lake view) are in same region with A and this will effect coefficient estimate More importantly, the house around U-shape lake show similar pattern and should be in the same region, we miss important information. Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0"

32 Motivation Our Approach: Capture the True Pattern Structure!
33 Motivation Our Approach: Capture the True Pattern Structure! We need to discover arbitrary shaped regions, and not rely on some a priori defined artificial boundaries Problems to be solved: 1. Find regions whose objects have a strong relationship between the dependent variable and independent variables 2. Extracting Regional Regression Functions 3. Develop a method to select which regression function to use for a new object to be predicted.

33 Methodology The REGional REGression Framework (REG^2)
34 Skip! Methodology The REGional REGression Framework (REG^2) Employs a two-phased approach: Phase I: Discovering regions using a clustering alg. Maximizing a regression based (R-sq or AIC ) fitness functions ( along with regional coefficient estimates) Phase II: Applying techniques to select correct regional regression function and improve prediction for unseen data

34 Methodology So, what Can we use as Interestingness?
35 Methodology So, what Can we use as Interestingness? The natural first candidate is Adjusted R2. R-sq is a measure of the extent to which the total variation of the dependent variable is explained by the model. R-sq alone is not a good measure to assess the goodness of fit; only deals with the bias of the model & ignores the complexity of model which leads to overfitting There are better model selection criteria to balance the tradeoff between bias and the variance. Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0"

35 Methodology Fitness Function Candidates R2-based fitness functions
36 Methodology Fitness Function Candidates R2-based fitness functions Fitness functions that additionally consider model complexity, in addition to goodness of fit, such as AIC or BIC Regularization approaches that penalize large coefficients. Fitness functions that employ validation sets that provide a better measure for the generalization error—the model’s performance on unseen examples An improvement of the previous approach that additionally considers training set/test set similarity Combination of approaches mentioned above Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0"

36 Methodology R-sq Based Fitness Function Given; and
37 Methodology R-sq Based Fitness Function Given; and The interestingness is: To battle the tendency towards having small size regions with high correlation (false correlation): used scaled version of the fitness function employed a parameter to limit the min. size of the region The Rsq-based fitness function then becomes;

37 Methodology AIC Based Fitness Function (AICFitness)
38 Methodology AIC Based Fitness Function (AICFitness) We prefer Akaike’s Information Criterion (AIC) because; it takes model complexity (number of observations etc..) into consideration more effectively AIC provides a balance between bias and variance, and is estimated using the following formula: Variations of AIC including AICu [McQuarrie] which is used for small size data is available  good fit for our small size regions

38 Methodology AIC-based Interestingness – iAIC (r)
39 Methodology AIC Based Fitness Function (AICFitness) AIC-based Interestingness – iAIC (r) AICFitness function then becomes AICFitness function repeatedly applies regression analysis during the search for the optimal set of regions which overall provides best AIC values (minimum)

39 q(X)= cX reward(c)=cX i(c) *size(c)
40 Methodology Controlling Regional Granularity β is used to control the number of regions to be discovered, thus overall model complexity. Finding a good value for β means striking the right balance between underfitting and overfitting for a given dataset. Small values for  small number of regions; large values for large number of regions Reminder—Region Discovery Framework Fitness Function: q(X)= cX reward(c)=cX i(c) *size(c)

40 41 Experiments & Results Generalization Error Results - Boston Housing Data β SSE_TE (GL) (REG2) SSE Improvement % of objects better prediction 1.1 17,182 12,566 27% 72% 1.7 14,799 26% 65% Generalization Error Improvement (SSE_TE) Discovered regions and their regional regression coefficients perform better prediction compared to the global model Some regions with very high error reduce the overall accuracy but still 27% improvement. (future work item) Relationship between variables spatially varies

41 Experiments & Results Generalization Error Results – Arsenic Data
42 Experiments & Results Generalization Error Results – Arsenic Data β SSE_TE (GL) (REG2) SSE Improvement % of objects better prediction 1.1 102, 578 98,879 3.6% 57% 1.25 92,200 8.01% 61% Regional regression coefficients perform just slightly better prediction Some due to external factors, e.g. toxic waste, power plant (analyzed previously using PCAFitness approach, MLDM09) Some regions with very high error reduce the overall accuracy Still around 60% of objects are better predicted Open for improvement; new fitness functions (next)

42 4.2 A Framework for Regional Association Rule Mining and Scoping [GeoInformatica10]
43 Step 1: Region Discovery Arsenic hot spots Step 2: Regional Association Rule Mining An association rule a is discovered. Figure \ref{cap:framework} illustrates the basic procedure of our approach. An association rule $a$, \emph{the wells with nitrate concentration lower than 0.085mg/l have dangerous arsenic concentration level}, is discovered from an arsenic hot spot area in South Texas with 100\% confidence. The scope of the association rule $a$ is a much larger area which mostly overlaps with the Texas Gulf Coast. Statistical analysis shows that the rule $a$ cannot be discovered at Texas state level due to its insufficient confidence (less than 50\%). Next, we give the formal definition of our problem. Step 3: Regional Association Rule Scoping Scope of the rule a

43 Arsenic Hot Spots and Cool Spots
44 Step 1: Region Discovery Step 2: Regional Association Rule Mining Step 3: Regional Association Rule Scoping

44 Example Regional Association Rules
45 rule 1 rule 3 rule 2 rule 4 Step 1: Region Discovery Step 2: Regional Association Rule Mining Step 3: Regional Association Rule Scoping

45 Region vs. Scope 46 Scope of an association rule indicates how regional or global a local pattern is. The region, where an association rule is originated, is a subset of the scope where the association rule holds.

46 Association Rule Scope Discovery Framework
47 Let a be an association rule, r be a region, conf(a,r) denotes the confidence of a in region r, and sup(a,r) denotes the support of a in r. Goal: Find all regions for which an associate rule a satisfies its minimum support and confidence threshold; regions in which a’s confidence and support are significantly higher than the min-support and min-conf thresholds receive higher rewards. Association Rule Scope Discovery Methodology: For each rule a that was discovered for region r’, we run our region discovery algorithm that defines the interestingness of a region ri with respect to an association rule a as follows: Remarks: Typically d1=d2=0.9; =2 (confidence increase is more important than support increase) Obviously the region r’ from which rule a originated or some variation of it should be “rediscovered” when determining the scope of a. The remainder of this talk centers on algorithm for supervised clustering. Currently we are investigating several clustering algorithms and on comparign their performance.

47 Regional Association Rule Scoping
48 Ogallala Aquifer Gulf Coast Aquifer The scope of an association rule A is the regions where the association rule A satisfies the minimum support and confidence threshold. TCEQ: Texas Commission of Environmental Quality Is very excited

48 Fine Tuning Confidence and Support
49 We can fine tune the measure of interestingness for association rule scoping by changing the minimum confidence and support thresholds.

49 4.3 (Regional) Models for Internet User Behaviour
50 Problem: We are interested in finding spatial patterns with respect to a performance variable based on some context that is described using a set of variables. Main Theme: We try to predict if a user clicks for given ad based on the keywords that occur in the ad / socio-ecomic factors / proximity to spatial objects of a particular type. Example Finding: We found that the click-through rate is significantly higher for zip codes in the proximity of airportsglobal pattern. Complication: Datasets are very large. Our subtopic: As usual, we are interested in extracting knowledge concerning the „regional variation of clicking behavior“. Contributors: Ruth Miller, Chun-sheng Chen, Abraham Bagherjeiran Our proposed framework makes the following assumptions…

50 Research Goals Yahoo! Project
51 Develop algorithms that generate groups and summarize characteristics of groups Propose similarity measures to compare different groups Compare different regional groups with respect to discrepancies of user behavior to: Extract regional knowledge from the groups Extract discrepancy knowledge that describes how the behavior of different users differs in different regions How regional behavior differs from global behavior Develop regional prediction techniques By using knowledge that has been obtained in step3 to create new features By generalizing our regional prediction work, presented in part 4.1

51 5. Methodologies and Tools to Analyze Related Datasets
52 Subtopics: Disparity Analysis/Emergent Pattern Discovery (“how do two groups differ with respect to their patterns?”) [SDE10] Change Analysis ( “what is new/different?”) [CVET09] Correspondence Clustering (“mining interesting relationships between two or more datasets”) [RE10] Meta Clustering (“cluster cluster models of multiple datasets”) Analyzing Relationships between Polygonal Cluster Models Example: Analyze Changes with Respect to Regions of High Variance of Earthquake Depth. Time 1 Time 2 Novelty (r’) = (r’—(r1 … rk)) Emerging regions based on the novelty change predicate

52 6. Summary 53 A framework for region discovery that relies on additive, reward-based fitness functions and views region discovery as a clustering problem has been introduced. Families of clustering algorithms and families of measures of interestingness are provided that form the core of the framework. Evidence concerning the usefulness of the framework for regional association rule mining, regional regression, and co-location mining has been presented. The special challenges in designing clustering algorithms for region discovery have been identified. The ultimate vision of this research is the development of region discovery engines that assist data analysts and scientists in finding interesting regions in spatial datasets.

53 Other Contributors to the Work Presented Today
54 Graduated PhD Students: Wei Ding (Regional Association Rule Mining, Grid-based Clustering) Rachsuda Jiamthapthaksin (Agglomerative Clustering, Multi-Run Clustering) Oner Ulvi Celepcikay (Regional Regression) Vadeerat Risurongkawong (Analyzing Multiple Datasets, Change Analysis) Current PhD Students Chun-sheng Chen (Density based Clustering, Regional Knowledge Extraction) Ruth Miller (Dataset Creation, Models for Internet Behavior) Graduated Master Students Rachana Parmar (CLEVER, Co-location Mining) Seungchan Lee (Grid-based Clustering, Agglomerative Clustering) Dan Jiang (Density-based Clustering, Co-location Mining) Jing Wang (Grid-based and Representative-based Clustering) Software Platform and Software Design Abraham Bagherjeiran (PhD student UH, now at Yahoo!) Domain Experts Tomasz Stepinski (Lunar and Planetary Institute, Houston, Texas) J.-P. Nicot (Bureau of Economic Geology, UT, Austin) Michael Twa (College of Optometry, University of Houston) Our proposed framework makes the following assumptions…

54 CLEVER Pseudo Code Inputs: Dataset O, k’, neighborhood-size, p, p’,
Outputs: Clustering X, fitness q Algorithm: 1. Create a current solution by randomly selecting k’ representatives from O. 2. Create p neighbors of the current solution randomly using the given neighborhood definition. 3. If the best neighbor improves the fitness q, it becomes the current solution Go back to step 2. 4. If the fitness does not improve, the solution neighborhood is re-sampled by generating p’ more neighbors. If re-sampling does not lead to a better solution, terminate returning the current solution; otherwise, go back to step 2 replacing the current solution by the best solution found by re- sampling.


Download ppt "Frameworks and Algorithms for Regional Knowledge Discovery"

Similar presentations


Ads by Google