Lecture 7: Introduction to Spatial Data Mining

Lecture 7: Introduction to Spatial Data Mining
Dr. Taysir Hassan Abdel Hamid Associate Professor, Information Systems Dept., Faculty of Computers and Information Assiut University April 3, 2016 INF424, GIS. Dr. Taysir Hassa A. Soliman

Examples of Spatial Patterns
Historic Examples 1855 Asiatic Cholera in London : A water pump identified as the source Modern Examples Cancer clusters to investigate environment health hazards Crime hotspots for planning police patrol routes Nile virus spreading INF424, GIS. Dr. Taysir Hassa Abdel Hamid

What is a Spatial Pattern ?
What is not a pattern? Random, haphazard, chance, stray, accidental, unexpected Without definite direction, trend, rule, method, design, aim, purpose Accidental - without design, outside regular course of things Casual - absence of pre-arrangement, relatively unimportant What is a Pattern? A rule, law, method, design, description A major direction, trend, prediction INF424, GIS. Dr. Taysir Hassa Abdel Hamid

What is Spatial Data Mining?
Defining Spatial Data Mining Search for spatial patterns Non-trivial search - as “automated” as possible—reduce human effort Interesting, useful and unexpected spatial pattern INF424, GIS. Dr. Taysir Hassa Abdel Hamid

What is Spatial Data Mining? - 2
Non-trivial search for interesting and unexpected spatial pattern Non-trivial Search Ex. cholera : causes: water, food, air, insects, …; water delivery mechanisms - numerous pumps, rivers, pipes, Interesting- Useful in certain application domain Ex. Shutting off identified Water pump => saved human life Unexpected Pattern is not common knowledge May provide a new understanding of world INF424, GIS. Dr. Taysir Hassa Abdel Hamid

What is NOT Spatial Data Mining?
Simple Querying of Spatial Data Find neighbors of Egypt given names and boundaries of all countries Find shortest path from Cairo to Alexandria in a freeway map Uninteresting or obvious patterns in spatial data Heavy rainfall in Alexandria is correlated with heavy rainfall in Damnhour, Given that the two cities are near in location. Common knowledge: Nearby places have similar rainfall INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Why Learn about Spatial Data Mining?
Two basic reasons for new work Consideration of use in certain application domains Provide fundamental new understanding Application domains Scale up secondary spatial (statistical) analysis to very large datasets Describe/explain locations of human settlements in last 5000 years Find cancer clusters to locate hazardous environments Prepare land-use maps from satellite imagery INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Why Learn about Spatial Data Mining? - 2
New understanding of geographic processes for Critical questions Ex. How is the health of planet Earth? Ex. Characterize effects of human activity on environment and ecology Traditional approach: manually generate and test hypothesis But, spatial data is growing too fast to analyze manually INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Why Learn about Spatial Data Mining? - 2
Number of possible geographic hypothesis too large to explore manually Large number of geographic features and locations Number of interacting subsets of features grow exponentially SDM may reduce the set of plausible hypothesis Identify hypothesis supported by the data For further exploration using traditional statistical methods INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Spatial Data Mining: Actors
Domain Expert - Identifies SDM goals, spatial dataset, Describe domain knowledge, e.g. well-known patterns, e.g. correlates Validation of new patterns Data Mining Analyst Helps identify pattern families, SDM techniques to be used Explain the SDM outputs to Domain Expert Joint effort Feature selection, Selection of patterns for further exploration INF424, GIS. Dr. Taysir Hassa Abdel Hamid

The Data Mining Process
Typically a “practical” data mining process involves close collaboration between the “domain expert” and the data mining analyst. The domain expert provides the data and the subject matter expertise. The data mining analyst has the experience of dealing with large data sets. The data mining analyst will apply conventional techniques like “Regression”, “Association Rules” and “Clustering” and generate a series of hypothesis which can then be tested and verified by classical statistical tools. Thus data mining is a filter step to generate rather than test hypothesis. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

INF424, GIS. Dr. Taysir Hassa Abdel Hamid
Choice of Methods 2 Approaches to mining Spatial Data 1. Pick spatial features; use classical DM methods 2. Use novel spatial data mining techniques Possible Approach: Define the problem: capture special needs Explore data using maps, other visualization Try reusing classical DM methods If classical DM perform poorly, try new methods Evaluate chosen methods rigorously Performance tuning as needed INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Families of SDM Patterns
Common families of spatial patterns Location Prediction: Where will a phenomenon occur ? Spatial Interaction: Which subsets of spatial phenomena interact? Hot spots: Which locations are unusual ? Note: Other families of spatial patterns may be defined SDM is a growing field, which should accommodate new pattern families INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Location Prediction Question addressed Where will a phenomenon occur? Which spatial events are predictable? How can a spatial events be predicted from other spatial events? Equations, rules, other methods, Examples: Which areas are prone to fire given maps of vegetation, etc.? What should be recommended to a traveler in a given location? Exercise: List two prediction patterns. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Hot spots Question addressed Is a phenomenon spatially clustered? Which spatial entities or clusters are unusual? Which spatial entities share common characteristics? Examples: Cancer clusters [CDC] to launch investigations Crime hot spots to plan police patrols Defining unusual Comparison group: neighborhood entire population Significance: probability of being unusual is high INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Mapping Techniques to Spatial Pattern Families
Overview There are many techniques to find a spatial pattern familiy Choice of technique depends on feature selection, spatial data, etc. Spatial pattern families vs. Techniques Location Prediction: Classification, function determination Interaction : Correlation, Association Hot spots: Clustering, Outlier Detection We discuss these techniques now With emphasis on spatial problems Even though these techniques apply to non-spatial datasets too INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Techniques for Location Prediction
Classical method: decision trees, bayesian classifier assumes learning samples are independent of each other Spatial auto-correlation violates this assumption! Q? What will a map look like where the properties of a pixel was independent of the properties of other pixels? New spatial methods Markov random field bayesian classifier INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Model Evaluation Confusion matrix M for 2 class problems 2 Rows: actual nest (True), actual non-nest (False) 2 Columns: predicted nests (Positive), predicted non-nest (Negative) 4 cells listing number of pixels in following groups Nest is correctly predicted—True Positive(TP) Model can predict nest where there was none—False Positive(FP) No-nest is correctly classified--(True Negative)(TN) No-nest is predicted at a nest--(False Negative)(FN) INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Model evaluation…cont
Outcomes of classification algorithms are typically probabilities Probabilities are converted to class-labels by choosing a threshold level b. For example probability > b is “nest” and probability < b is “no-nest” TPR is the True Positive Rate, FPR is the False Positive Rate INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Techniques for Association Mining
Classical method: Association rule given item-types and transactions assumes spatial data can be decomposed into transactions However, such decomposition may alter spatial patterns New spatial methods Spatial association rules Spatial co-locations INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Associations, Spatial associations, Co-location Answers: and find patterns from the following sample dataset? INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Unique Properties of Spatial Patterns
Items in a traditional data are independent of each other, whereas properties of locations in a map are often “auto-correlated”. Traditional data deals with simple domains, e.g. numbers and symbols, whereas spatial data types are complex Items in traditional data describe discrete objects whereas spatial data is continuous INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Example: Clustering and Auto-correlation
Note clustering of nest sites and smooth variation of spatial attributes INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Basic of Probability Conditional Probability: Given that an event B has occurred the conditional probability that event A will occur is P(A|B). A basic rule is P(AB) = P(A|B)P(B) = P(B|A)P(A) Baye’s rule: allows inversions of probabilities INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Co-location Patterns A co-location pattern is a group of spatial features/events that are frequently co-located in the same region. For example, human cases of West Nile Virus often occur in the regions with poor mosquito control and the presence of birds. For co-location pattern mining, the studies often emphasize the equal participation of every spatial feature. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Objective of co-location pattern mining is to find frequently co-located subsets of spatial features. For Example, a co-location “{traffic jam, police, car accident}” means that a traffic jam, police, and a car accident frequently occur in a nearby region. To capture the concept of “nearby”, the concept of user-specified neighbor-set was introduced. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Spatial co-location pattern mining is similar to association mining. A spatial association rule is a rule of the form “A ->B” where A and B are sets of predicates and some of which are spatial ones INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Example Integrate between agencies involved in the response to road accidents and enable them to exchange the data needed to support each of them in addressing and handling road accidents. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Example: Mining of spatial association rules Steps
1- Extract spatial objects from the spatial database, 2- For each object of the reference class, Discover its specified spatial relationship with spatial objects of other classes, 3- Mine spatial association rules INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Mining of spatial association rules requires
1) A spatial database, 2) A reference class (“accidents” , 3) A set of classes whose relationships with the reference class are interesting to users (“road,” “buildings,” “fire stations,” “traffic units,” and “medical services” , 4) A spatial relationship type (“close to” relationship between the reference class and other classes , 5) A minimum support threshold and a minimum confidence threshold. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Colocation Rules Motivation Association rules need transactions (subsets of instance of item-types) Spatial data is continuous Decomposing spatial data into transactions may alter patterns spatial co-location mining seeks to discover a set of features whose instances are frequently co-located in close proximity. Co-location Rules For point data in space Does not need transaction, works directly with continuous space Use neighborhood definition and spatial joins INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Spatial Co-location Mining (Example)
Spatial co-locations represent a subset of features whose instances are frequently located together in spatial neighborhoods. The co-location mining procedures starts with spatial datasets of our interest civil defense, medical services, traffic agency and the roads departments and work with conjunction Main application. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Scenario 1- An Accident happens and fire at same location. Traffic web service and civil defense inform Main Web Service 2- Main web service traffic, and civil defense informing them that there are conflicts in their way to solve the issue and at same time request needed information for both of them from medical web service 3- Medical web services send needed information to Main Web Service INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Spatial Co-location Mining (Cont…)
4- Main Web Service synchronizes gained information from medical and sends it to both traffic and civil defense. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Association Rules Discovery
Association rules has three parts rule: XY or antecedent (X) implies consequent (Y) Support = the number of time a rule shows up in a database Confidence = Conditional probability of Y given X Examples Generic – Chocolate and milk sell together weekday evenings [Carfour] Spatial: (bedrock type = limestone), (soil depth < 50 feet) => (sink hole risk = high) support = 20 percent, confidence = 0.8 Interpretation: Locations with limestone bedrock and low soil depth have high risk of sink hole formation. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Co-location rules vs. association rules Pr.[ A in N(L) | B at location L ] Pr.[ A in T | B in T ] conditional probability metric Neighborhood (N) Transaction (T) collection events /Boolean spatial features item-types support discrete sets Association rules Co-location rules participation index prevalence measure continuous space Underlying space Participation index = min{pr(fi, c)} Where pr(fi, c) of feature fi in co-location c = {f1, f2, …, fk}: = fraction of instances of fi with feature {f1, …, fi-1, fi+1, …, fk} nearby N(L) = neighborhood of location L INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Clustering analysis What is Clustering Analysis? Clustering in Data Mining Applications Handling Different Types of Variables Major Clustering Techniques Outlier Discovery Problems and Challenges INF424, GIS. Dr. Taysir Hassa Abdel Hamid

What Is Clustering ? Clustering is a process of partitioning a set of data (or objects) into a set of meaningful sub-classes, called clusters. Help users understand the natural grouping or structure in a data set. Cluster: a collection of data objects that are “similar” to one another and thus can be treated collectively as one group. Clustering: unsupervised classification: no predefined classes. Used either as a stand-alone tool to get insight into data distribution or as a preprocessing step for other algorithms. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

What Is Good Clustering?
A good clustering method will produce high quality clusters in which: the intra-class (that is, intra-cluster) similarity is high. the inter-class similarity is low. The quality of a clustering result also depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Requirements of Clustering in Data Mining
Scalability Dealing with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Interpretability and usability. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Clustering analysis What is Clustering Analysis? Clustering in Data Mining Applications Handling Different Types of Variables Major Clustering Techniques Outlier Discovery Problems and Challenges INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Applications of Clustering
Clustering has wide applications in Pattern Recognition Spatial Data Analysis: create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining. Image Processing Economic Science (especially market research) WWW: Document classification Cluster Weblog data to discover groups of similar access patterns INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs. Land use: Identification of areas of similar land use in an earth observation database. Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. City-planning: Identifying groups of houses according to their house type, value, and geographical location. Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Major Clustering Techniques
Clustering techniques have been studied extensively in: Statistics, machine learning, and data mining with many methods proposed and studied. Clustering methods can be classified into: partitioning algorithms hierarchical algorithms density-based method INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Some Categories of Clustering Methods
Partitioning algorithms: Construct various partitions and then evaluate them by some criterion. Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion. Density-based: based on connectivity and density functions INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Partitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion. Global optimal: exhaustively enumerate all partitions. Heuristic methods: k-means and k-medoids algorithms. k-means (MacQueen’67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition. Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e agglomerative (AGNES) divisive (DIANA) INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98) INF424, GIS. Dr. Taysir Hassa Abdel Hamid

DBSCAN: A Density-Based Clustering Method
DBSCAN: Density Based Spatial Clustering of Applications with Noise. Proposed by Ester, Kriegel, Sander, and Xu (KDD’96) Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Handling Complex Shaped Clusters
INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Density-Based Clustering: Background
Two parameters: Eps: Maximum radius of the neighbourhood MinPts: Minimum number of points in an Eps-neighbourhood of that point NEps(p): {q belongs to D | dist(p,q) <= Eps} Directly density-reachable: A point p is directly density-reachable from a point q wrt. Eps, MinPts if 1) p belongs to NEps(q) 2) core point condition: |NEps (q)| >= MinPts p q MinPts = 5 Eps = 1 cm INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Density-Based Clustering: Background (II)
Density-reachable: A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi Density-connected A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. p p1 q p q o INF424, GIS. Dr. Taysir Hassa Abdel Hamid

DBSCAN: General Ideas Core Border Outlier Eps = 1cm MinPts = 5 INF424, GIS. Dr. Taysir Hassa Abdel Hamid

DBSCAN: The Algorithm Arbitrary select a point p Retrieve all points density-reachable from p wrt Eps and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed. INF424, GIS. Dr. Taysir Hassa Abdel Hamid

What Is Outlier Discovery?
What are outliers? The set of objects are considerably dissimilar from the remainder of the data Example: Sports: Michael Jordon, Wayne Gretzky, ... Problem Given: Data points Find top n outlier points Applications: Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Idea of Outliers What is an outlier? Observations inconsistent with rest of the dataset Techniques for global outliers Statistical tests based on membership in a distribution Pr.[item in population] is low Non-statistical tests based on distance, nearest neighbors, convex hull, etc. What is a special outliers? Observations inconsistent with their neighborhoods A local instability or discontinuity INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Thank you INF424, GIS. Dr. Taysir Hassa Abdel Hamid

Lecture 7: Introduction to Spatial Data Mining

Similar presentations

Presentation on theme: "Lecture 7: Introduction to Spatial Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 7: Introduction to Spatial Data Mining

Similar presentations

Presentation on theme: "Lecture 7: Introduction to Spatial Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback