Result of N Categorical Variable Regional Co-location Mining

Slides:



Advertisements
Similar presentations
Latent normal models for missing data Harvey Goldstein Centre for Multilevel Modelling University of Bristol.
Advertisements

Jeffrey D. Ullman Stanford University. 2  A set of nodes N and edges E is a region if: 1.There is a header h in N that dominates all nodes in N. 2.If.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Clustering V. Outline Validating clustering results Randomization tests.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Latent Growth Curve Modeling In Mplus:
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.
Organization “Association Analysis”
Maxent interface.
Mining Sequence Patterns from Wind Tunnel Experimental Data Zhenyu Liu †, Wesley W. Chu †, Adam Huang ‡, Chris Folk ‡, Chih-Ming Ho ‡
Neural Networks. R & G Chapter Feed-Forward Neural Networks otherwise known as The Multi-layer Perceptron or The Back-Propagation Neural Network.
CS Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4.
Fast Algorithms for Association Rule Mining
Crash Course on Machine Learning
1 G Lect 11W Logistic Regression Review Maximum Likelihood Estimates Probit Regression and Example Model Fit G Multiple Regression Week 11.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for the Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Overview The goal of Project3 is to design a region discovery algorithm and evaluate.
1 G Lect 8b G Lecture 8b Correlation: quantifying linear association between random variables Example: Okazaki’s inferences from a survey.
Modular 11 Ch 7.1 to 7.2 Part I. Ch 7.1 Uniform and Normal Distribution Recall: Discrete random variable probability distribution For a continued random.
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Supermarket shelf management – Market-basket model:  Goal: Identify items that are bought together by sufficiently many customers  Approach: Process.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Mathematics and Statistics Boot Camp II David Siroky Duke University.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
Data Mining & Machine Learning Group ACM-GIS08 Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding.
COMP155 Computer Simulation September 8, States and Events  A state is a condition of a system at some point in time  An event is something that.
Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 8 Sept 23, 2005 Nanjing University of Science & Technology.
Logistic Regression. Linear Regression Purchases vs. Income.
CONTINUOUS RANDOM VARIABLES
Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.
Latent regression models. Where does the probability come from? Why isn’t the model deterministic. Each item tests something unique – We are interested.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries that Characterize Clustering Results Focus: Primary Focus Cluster Summarization (what.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
ITK. Ch 9 Segmentation Confidence Connected Isolated Connected Confidence Connected in Vector Images Jin-ju Yang.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
BINARY LOGISTIC REGRESSION
5.2 Normal Distributions: Finding Probabilities
CONTINUOUS RANDOM VARIABLES
Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab  aims at the development of data analysis, data mining, GIS and artificial.
Speaker: Jim-an tsai advisor: professor jia-lin koh
How to Find Data Values (X) Given Specific Probabilities
More about Posterior Distributions
Lesson 1-4: The Distributive Property
Assessing Normality.
G1. G1 g2 g3 g4 g5 g6 g8 g10 g11 g12 g14.
Brainstorming How to Analyze the 3AuCountHand Datasets
STAT 312 Introduction Z-Tests and Confidence Intervals for a
Optimal Degrees of Synaptic Connectivity
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Review for Test1.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Chapter 9: One- and Two-Sample Estimation Problems:
Multivariate Methods Berlin Chen
Overview Functional Testing Boundary Value Testing (BVT)
Recap: Naïve Bayes classifier
Introduction to Probability: Solutions for Quizzes 4 and 5
Exploiting the Power of Group Differences to Solve Data Analysis Problems Outlier & Intrusion Detection Guozhu Dong, PhD, Professor CSE
Comments Task AS1 Tasks 12 Given a collection of boolean spatial features, the co-location pattern discovery process finds the subsets of features that.
Presentation transcript:

Result of N Categorical Variable Regional Co-location Mining November 11, 2008 Result of N Categorical Variable Regional Co-location Mining Dataset, 4 class lattices: 0.3,0.4,A,B ,C ,D 0.2,0.3,A ,B,C,D AsP AsFeSe AsSeFeF AsF

Regional Co-Location Mining Framework for q Binary Variables Example: Co-location set CS={A,B,C}; CoLoc-interestingness({A,B,C},r)= ((r,A) 1)((r,B) 1)((r,C) 1) Dataset O Remark: Strength is computed by comparing r with the whole Dataset O. r Remarks: Have to iterate over all possible co-location sets Interestingness of the region is the interestingness of the maximum valued co-location set

Region Interestingness Region interestingness is assessed by computing the most prevalent pattern: Region interestingness solely depends on the most interesting co-location set for the region.

Example of a Result All experiments: P(B) = (AsB or AsB) and |B|<5. Experiment 1  = 1.3, … Exp. No. Top 5 Region s Region Size Region Reward Maximum Valued Pattern in theRegion Average Product for maximum valued pattern Exp. 1 1 23 174.3191 AsMoVF- 211.0179 2 40 104.8576 AsMoV 161.3194 3 11 92.9385 AsMoVSO42- 170.3873 4 36 89.4068 AsBCl-TDS 153.2687 5 7 30.5775 AsMoCl-TDS 53.5107

Co-Location Mining Framework for q Binary Class Variables Version1 O – dataset rO – a region oO – object in the dataset O CS= {C1,…,Cr} – set of binary class variables that form base patterns; oCo.C=true th – class multiplier interestingness threshold, default-value 1   [0, ∞) – form parameter, default value 1 CCS be a single class variable BCS – a co-location set P(B) is a predicate over B that restricts the set of co-location sets considered; e.g. P(B)=|B|<5 or P(B)=AsB (“only look for patterns involving high arsenic”) (r,C)=(|{or|oC}|)/|r|)/(|{oO|oC}|/|O|) – C’s probability multiplier in r; high interestingness is associated with high multipliers z(C,r)= If (r,C)>th then ((r,C)-th) else 0 – normalized interestingness for C in r k(B,r)= CBz(C,r) – normalized interestingness of co-location set B in r i(r)=maxBS & |B|>1 and P(B)k(B,r) – region interestingness; maximum normalized interestingness observed for subsets BCS constrained by P Reward(r)= i(r)*|r|b

Co-Location Mining Framework for q Binary Class Variables Version2 O – dataset rO – a region oO – object in the dataset O CS= {C1,…,Cr} – set of binary class variables that form base patterns; oCo.C=true th1 – co-location set interestingness threshold   [0, ∞) – form parameter, default value 1 CCS be a single class variable BCS – a co-location set P(B) is a predicate over B that restricts the set of co-location sets considered; e.g. P(B)=|B|<5 or P(B)=AsB (“only look for patterns involving high arsenic”) (r,C)=(|{or|oC}|)/|r|)/(|{oO|oC}|/|O|) – C’s probability multiplier in r; high interestingness is associated with high multipliers k(B,r)= CB (r,C) –interestingness of co-location set B in r i’(r)=maxBS & |B|>1 and P(B)k(B,r) – region interestingness; maximum interestingness observed for subsets BCS constrained by P i(r)=IF i’(r)> th THEN (i’(r)-th) ELSE 0 –normalized region interestingness (th>=1;  form parameter) Reward(r)= i(r)*|r|b

Datasets and Program Interface Discretize the z-score normalized variable as follows: z(A)1: A -1 z(A) 1: A Otherwise: A The transformed dataset therefore have the form: <longitude, latitude, <class-variable>+ ) Limit Co-location sets we are looking for in experiments to “” and “” class variables, to make it comparable to the continuous approach Limit Co-location Sets to sizes 2-4 in the experiments! Possibly conduct experiments with large sets using a single seed pattern; e.g. D Therefore the program inputs of the categorical regional collocation mining versions should include: k’ ---the maximum set size considered The seed pattern, e.g. B, if we have a seed pattern; if we do not have a seed pattern is given all sets of sizes 2,…,k’ will be considered Pattern list considered