Download presentation
1
Spatial Outlier Detection and implementation in Weka Implemented by: Shan Huang Jisu Oh CSCI8715 Class Project, April Presented by Jisu Oh (Group 2) Slides Available at
2
Topics: Motivation Problem Statement Key Concepts Major Contributions
Validation Methodology Assumptions Conclusions Future work
3
Motivation Machine learning /Data mining
Enables a computer program to analyze large-scale data Decide important information which can be used to make predictions or to make decisions faster and more accurately.
4
Motivation Weka A collection of machine learning algorithms for solving real-world data mining problems Provides data mining functions (eg, regressions, association rules, and clustering algorithm) Limitation: operates on traditional non-spatial database
5
Problem Statement Input Data set Output : detected outliers as
Minneapolis/St. Paul traffic data set Output : detected outliers as Plain text (timeslot, time, station, Zs(x)) Overall traffic volume Neighbor relationship graph between stations
6
Problem Statement(cont.)
Constraints Algorithm from paper “A unified approach Detecting Spatial Outliers” Dataset should be numeric Objective To find sets of spatial outliers and show the results visually
7
Key Concepts Spatial outliers
Definition – spatially referenced objects whose non-spatial attribute values are significantly different from the values of its neighborhood. Example – a new house in an old neighborhood of a growing metropolitan area In this project, outlier is one station which has a high volume compared to the neighboring stations at certain time slot. As you knew, spatial outliers are ~~~ For example,
8
Key Concepts (contd.) Algorithm S(x) = [f(x)-Ey∈ N(x)(f(y))]
Proposed in the paper, “A Unified Approach to Detecting Spatial Outliers”, by S. Shekhar, C. T. Lu, and P. Zhang S(x) = [f(x)-Ey∈ N(x)(f(y))] : difference between f(x) - attribute value of a sensor located at x Ey - average attribute value of x’s neighbors Zs(x) = |s(x) –s/σs| > θ : spatial statistic, where θ is a z-score for user specified confidence interval
9
Key Concepts (contd.) Algorithm (example) 1 2 3 4 5 S(x) = f(x) –Ey
S(x) = f(x) –Ey = 100 – (2+8)/2 = 95 s : 0.22 σs : 23.8 Zs(x) = |s(x) –s|/σs = 3.98 Z-score for 95% C.I. = 2 3.98 > 2 Thus, 100 is an outlier Outlier is replaced by Ey. 100 -> 5
10
Major Contributions Top k outliers query processing
User interface similar to an UI of Weka Providing visualization of outliers plain text (time slot, time, station, Zs(x)) overall traffic volume neighbor relationship graph between stations Keeping user-specified results
11
Major Contributions (contd.)
Top k outliers query processing Fig.1. Top 3 outliers from dataset N.dat
12
Major Contributions (contd.)
User Interface Weka based. Add one more button in weka. Same framework But work Independently. Simple and easy to use. Satisfy all user interface properties. (simple, user language, reduce memory, …) User specified confidence intervals, 68%, 95%, 99%, and number of outliers to find. Weka doesn’t provide enough options for detecting spatial outlier so that we need our own interface for that. Fig.2 User interface of the spatial outlier detection application v.s. weka
13
Major Contributions (contd.)
Visualization outliers Benefit~ Where these information(visualization) can be applied. Fig.3 Plain text results of detected outliers
14
Major Contributions (contd.)
Detected outliers Visualization outliers Fig.4 Overall traffic volume and Neighbor relationship graph between stations
15
Major Contributions (contd.)
Visualization outliers Fig.4 Overall traffic volume and Neighbor relationship graph between stations
16
Major Contributions (contd.)
Keeping Results Enable to save and print user-specified results User allow to keep their all results by saving and printing them. Enable to save and print all text results and image(traffic volume, stations relationship) Why this function is needed? Can compare and contrast each results using this information. Let’s go to the DEMO!
17
Validation Methodology
Experiments with three different data set Data set Most outliers found at station N.dat 24 N.dat N.dat 124 Provide three examples using different data set. Data set 1 : N.dat : station 24 Data set 2 :16 station 24 Data set 3 :125 station 124 Show station relationship Station number chosen as one of outliers works parameter of visualization stations. This allows users easily see neighbor relationship between stations. In other words, users can see why that station should be one of outliers.
18
Assumptions Data format is set
The original data consists of traffic volume and occupancy. Detection outlier is based on volume. Data format : @relation N @station 150 @timeslot 288 …. Users are familiar with statistical concepts (e.g., confidence interval, C.I.)
19
Conclusion Adding one more package in Weka to find sets of spatial outliers Showing results visually in the user interface similar to the user interface of Weka by top k outliers query processing providing visualization of outliers allowing to keep user-specified results
20
Future work - e.g., SAR(Spatial Auto Regression), co-location
Upgrade to allow various file format and data type Experiments to find more efficient algorithm using different outlier detection algorithms Add more spatial data mining options - e.g., SAR(Spatial Auto Regression), co-location
21
Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.