Access Pattern Analysis, Ideas and Alternative Approaches Pradeep Mohan Crimestat: Performance Tuning.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

1 Spatial Join. 2 Papers to Present “Efficient Processing of Spatial Joins using R-trees”, T. Brinkhoff, H-P Kriegel and B. Seeger, Proc. SIGMOD, 1993.
Spatio-temporal Databases
Hierarchical Clustering, DBSCAN The EM Algorithm
Efficient Evaluation of k-Range Nearest Neighbor Queries in Road Networks Jie BaoChi-Yin ChowMohamed F. Mokbel Department of Computer Science and Engineering.
PARTITIONAL CLUSTERING
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Searching on Multi-Dimensional Data
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.
Yu Stephanie Sun 1, Lei Xie 1, Qi Alfred Chen 2, Sanglu Lu 1, Daoxu Chen 1 1 State Key Laboratory for Novel Software Technology, Nanjing University, China.
Introduction to Spatial Database System Presented by Xiaozhi Yu.
Da Yan, Zhou Zhao and Wilfred Ng The Hong Kong University of Science and Technology.
Constructing Popular Routes from Uncertain Trajectories Authors of Paper: Ling-Yin Wei (National Chiao Tung University, Hsinchu) Yu Zheng (Microsoft Research.
Constructing Popular Routes from Uncertain Trajectories Ling-Yin Wei 1, Yu Zheng 2, Wen-Chih Peng 1 1 National Chiao Tung University, Taiwan 2 Microsoft.
Computational Geometry and Spatial Data Mining
Clustering II.
Spatio-temporal Databases Time Parameterized Queries.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
The Fourth WIM Meeting 1 Active Nearest Neighbor Queries for Moving Objects Jan Kolar, Igor Timko.
1 SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases Mohamed F. Mokbel, Xiaopeng Xiong, Walid G. Aref Presented by.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Scalable Network Distance Browsing in Spatial Database Samet, H., Sankaranarayanan, J., and Alborzi H. Proceedings of the 2008 ACM SIGMOD international.
Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Distance Indexing on Road Networks A summary Andrew Chiang CS 4440.
Ch 5 Practical Point Pattern Analysis Spatial Stats & Data Analysis by Magdaléna Dohnalová.
Clustering Unsupervised learning Generating “classes”
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
University of Toronto Department of Computer Science © 2001, Steve Easterbrook CSC444 Lec22 1 Lecture 22: Software Measurement Basics of software measurement.
VLDB '2006 Haibo Hu (Hong Kong Baptist University, Hong Kong) Dik Lun Lee (Hong Kong University of Science and Technology, Hong Kong) Victor.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Mapping and analysis for public safety: An Overview.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering Moving Objects in Spatial Networks Jidong Chen, Caifeng Lai, Xiaofeng Meng, Renmin University of China Jianliang Xu, and Haibo Hu Hong Kong.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
9/2/2005VLDB 2005, Trondheim, Norway1 On Computing Top-t Most Influential Spatial Sites Tian Xia, Donghui Zhang, Evangelos Kanoulas, Yang Du Northeastern.
Clustering.
Mesh Coarsening zhenyu shu Mesh Coarsening Large meshes are commonly used in numerous application area Modern range scanning devices are used.
What’s the Point? Working with 0-D Spatial Data in ArcGIS
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.
Efficient Computing k-Coverage Paths in Multihop Wireless Sensor Networks XuFei Mao, ShaoJie Tang, and Xiang-Yang Li Dept. of Computer Science, Illinois.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
SocialVoD: a Social Feature-based P2P System Wei Chang, and Jie Wu Presenter: En Wang Temple University, PA, USA IEEE ICPP, September, Beijing, China1.
Machine Learning Queens College Lecture 7: Clustering.
1 Complex Spatio-Temporal Pattern Queries Cahide Sen University of Minnesota.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.
Presented by: Mi Tian, Deepan Sanghavi, Dhaval Dholakia
Semi-Supervised Clustering
More on Clustering in COSC 4335
Clustering CSC 600: Data Mining Class 21.
Module 11: File Structure
T-Share: A Large-Scale Dynamic Taxi Ridesharing Service
Clustering (3) Center-based algorithms Fuzzy k-means
Spatio-temporal Pattern Queries
K-means and Hierarchical Clustering
K Nearest Neighbor Classification
Voronoi-based Geospatial Query Processing with MapReduce
Finding Fastest Paths on A Road Network with Speed Patterns
CSE572, CBS572: Data Mining by H. Liu
Exact Nearest Neighbor Algorithms
Nearest Neighbors CSC 576: Data Mining.
Text Categorization Berlin Chen 2003 Reference:
CSE572: Data Mining by H. Liu
Presentation transcript:

Access Pattern Analysis, Ideas and Alternative Approaches Pradeep Mohan Crimestat: Performance Tuning

Outline Overview - Crimestat Motivation Background –Datasets Description –Distance Calculation –Types of Distance Requests –Function Categories Access Pattern Analysis –K- Nearest Neighbor Analysis –Hotspot Analysis K Means Hierarchical Clustering –Access Patterns – Other Modules –Journey To Crime Journey To Crime Crime travel Demand Problem Definition Proposed Approach –Voronoi Diagram –TAZ Approximation –K Order Nearest Neighbor –NNH –Network Assignment –Spatial Indexing Challenges References

Overview - Crimestat A multi-threaded windows application for crime mapping and analysis. Main modules of interest –Distance Analysis (K- Nearest Neighbor Analysis) –Hotspot Analysis (Nearest Neighbor Hierarchical Clustering and K Means) –Journey To Crime( Bayesian Journey to Crime) –Space Time Analysis (Knox Index) –Crime Travel Demand (Network Assignment) Datasets –Multiple point datasets (Ex – Criminal Arrest Record (with location, time) and Crime Incidence Record (with location and time) ) –Traffic Analysis Zones and Road Network data sets (for Journey to Crime and Crime Travel Demand). Different distance metrics computed between different point sets. (Ex. Euclidean, Spherical, Manhattan and Network.

Motivation How are crime incidents clustered together? Hot Spots Analysis Journey to Crime Estimation What are the predicted trips of a serial offender? Courstsey: ESRI.

Centroid of a TAZ Background - Datasets Description Courtsey : Ned Levine and Associates

Background - Datasets Description

Primary Pointset Courtsey: Ned Levine and Associates Background - Datasets Description

Secondary Pointset Courtsey: Ned Levine and Associates Background - Datasets Description

Background – Distance Calculation 1.set of N x K points in Euclidean space (or on a network) 2.Distance between all pairs of points. (What is the problem ?) 3.Normal computation takes O(n 2 ) (Why is it hard ?) Distance Matrix q1 q2q3q4q5qk p1 p2 p3 p4 p5 pn 1.Euclidean Distance 2.Manhattan Distance 3.Spherical Distance 4.Network Distance A Distance Cell 1.Do all functions require these distances? 2.Can calculated distances be re-used? 3.How do we store them? (in a file) 4.How do we efficiently search for them? 5.Is there a single algorithm to calculate all these distances? 6.Can they be calculated on the fly? Calculations in a single threaded application

Background - Types of Distance Requests given a distance d, find all points separated within this distance (whole dataset!!) Given a point p, find its k order nearest neighbors. (Ex. p 6 is 1 st order and p 2 is 2 nd order). Find a set P ( of incident points ), given a zone Z, a polygon. Given a set of Points P and another set Q, find all pair euclidean distance.

Background – Function Categories 1.K - Nearest Neighbor Analysis based modules Nearest Neighbor Analysis (K order) Ripley’s K Statistic (simulation) Point to Point allocation Point to Zone Allocation 2.Hot Spot Analysis based modules K Means clustering Spatio Temporal Analysis of Crime Anselin’s Moran Nearest Neighbor Hierarchical clustering Risk Adjusted Nearest Neighbor Hierarchical Clustering 3.Space Time Analysis 4.Crime Trip Estimation 5.Travel Demand Modeling

Access Patter Analysis Analysis

K- Nearest Neighbor Analysis Input: A set of incident locations, N K (order of NN) Ouput: 1.1- order Nearest Neighbor for all N 2.K- order Nearest Neighbor for all N Method: 1. For every point ( p i, p k )є P, computes the distance. 2. For every particular point quick sorts all distances to get the nearest neighbor. 3. For, K order, top K neighbors of a point are selected. 4. Statistics calculated: mean Random distance, Nearest Neighbor Index etc.

d1 d2 d3 d4 dkdk All K-order distances calculated. Such computations performed on whole dataset. A distance matrix is calculated – O(n.n) Access Patterns

Assigning Point to Point, Point to Zone (polygon or grid) Point to Point Assignment Input: A set P of N points and another Q set of M points. Output: An assignment of each point in P to a point of Q. Method: Proximity calculation – For every point in P to every other point in Q. Point to Polygon Assignment Input: A set P of N points and another set Q of M polygons. Output: An assignment of each point in P to a polygon of Q. Method: Point in Polygon for M x N times. A Pre-computed distance Matrix is used currently for distances.

Access Pattern Secondary Point Primary Point SiSi PiPi d1d1 d2d2 d6d6 d5d5 d4d4 d3d3 Distance of every Pi with every Si is calculated and ordered. A distance Matrix is used.

TAZ Centroid Incident Point (in P ) PiPi Courtsey : Ned Levine and Associates

Ripley’s K Statistic Input: Set of N Points. Output: L(t) : measure of second order clustering Method: Draw a circle around each point, Collect all points within the radius. Increase the radius and repeat above operations Repeat for 100 increments of radius till maximum distance Random Point Set is generated, so distance matrix needs to be recalculated.

Primary Point O1O1 O2O2 O1 and O2 are different order radii. K order radii are computed for each point. O(N.K), N Points Distance calculation O(N.N) (for every new simulation)

Hot Spot Analysis Modules

Mode and Fuzzy Mode Input: A set of N Points. A Radius R. (Fuzzy Mode) Output: Frequency of Incidents at each point – Mode Frequency of incidents at a radius from a point – Fuzzy Mode Method: Mode – A frequency count of number of incidents on each point. (For N points) Fuzzy Mode – A frequency count of incident within a radius around a particular point.

Access Patterns: K Means Computing initial seeds –Using secondary set of points as seeds –Overlay a grid and cell with highest count is seed. Grid approach expensive in O(gridsize x k), k: number of clusters. If a grid size doesn’t produce a cluster another is tried. – worst case Distance measurements of all “n” points with k clusters performed till convergence. O (k x n x iterations till convergence)

Nearest Neighbor Hierarchical Clustering (NNH) Input: 1.A set of N points (same file) 2.A search distance, d (random or fixed) 3.No. of simulations, k (order) 4.Min number of points per cluster Output: 1. All k order clusters Conditions: Distance between pairs of points > d Cluster size > = minimum number of points. Method: 1.Compute all pair euclidean distance. 2.Prune based on distance threshold. Computation Saving: Distances have been calculated already. N = 1349, Fixed Distance (dt) = 5 miles Pair Count = (so many distances evaluated) Courtsey : Ned Levine and Associates

Access Patterns

Risk Adjusted Nearest Neighbor Hierarchical Clustering (NNH) Use of baseline variable ( Ex. census blocks). Interpolate to a grid size based on primary file. (say size N) Determine absolute densities (of secondary) as points per grid cell. Proceed as NNH. An O(N.N) for calculating grid parameters, N is primary point set size.

Spatio Temporal analysis of Crime (STAC) Primary Point Area divided into grids Circle drawn on grids Circles pruned based on number of points. Intersecting circles merged.

Access Patterns: Other Modules Other Modules Knox Index Mantel Index Distance Matrix re-computed every time for each simulation (simulated point set).

Journey To Crime

Journey to Crime Input: A List of incidents committed by a serial offender. A travel decay function. A Reference Grid. Origin and Destination of offenders. (Case 2) Output: The origin of crime (home of offender) Crime Trip (case 2) Observation: O( N(grid cells).Incidents)

Courtsey : Ned Levine and Associates

Crime Travel Demand Given Origin and Destination of Criminal – Generate Trips. Assign Origins and Destinations to Zones (Ex. TAZ, Census Blocks). Predict trips based on various demand models. All data points of Primary and Secondary file accessed. Distance computations are O(N.N).

Network Assignment Given: A set of Crime Trips. A Transportation Network Find: The actual route on the transportation network Constraints: Routes are weighted by both distance and time Computation: an Expensive Join between Euclidean Crime trip and a Network based on constraints.

Courtsey : Ned Levine and Associates

Problem Definition Given: set of N Primary Points set of M secondary points TAZ or Census Block A transportation Network User Defined Parameters Request for a particular task based on spatial proximity. Find: Proximity measure in terms of distance. Objective: Define a suitable data structure for storage of input data. Define a suitable Hierarchical Index. Define a suitable Join (and Hierarchical Join Index) between different spatial sets. Define an appropriate storage representation on disk. Constraints: Find out all requested proximity measures with lower cost of computation.

Related Work Naïve approach to distance calculation Lazy approach to distance calculation. R Tree based Index Hierarchical Join Index Hierarchical Voronoi based Index Program Address Space Distance File p1p1 pnpn q1 q2q3q4q5qn p1 p2 p3 p4 p5 pn Useful only for Euclidean space. Networks also need to be stored and accessed separately. Costly Joins Need to be computed R Tree Index Structure

Proposed Approach – Voronoi Diagram 1.Based on spatial proximity of points. 2.O(nLogn) to calculate the diagram. 3.Distance of nearest neighbors stored during construction. 4.A hierarchical index based on voronoi to be constructed. 5.Voronoi Joins for Euclidean and Networks.

TAZ Centroid Incident Point (in P ) Proposed Approach Point to Point and Point to Polygon Assignment – TAZ Approximation

p1p1 p2p2 p6p6 p5p5 p4p4 p3p3 d1<=d d2<=d >=d Z Query Point All “k” order nearest neighbors Median Center of a Polygon (TAZ) Incident Point ( belongs to set P) K- Order Nearest Neighbor Calculation For every edge in voronoi, The sites split by an edge is known. (during computation of voronoi) There are O(N) edges (for N points). An edge traversal gives a pair of distance. (which is stored) Points are kept in an ordered bucket. Quick sort within the points all its distances. Also store its neighbor connected by that distance.

Proposed Approach – NNH Use of an already existing algorithm Amoeba by Castro et al.[2] O(nLogn), n is the number of points. Makes use of Delaunay Triangulation.

Proposed Approach - Network Assignment Use of a Network Voronoi Diagram, Graf and Winter [1]. Computation of the Voronoi Diagram of the Euclidean Crime Trips ( Origin Destination points) Computing a Join between the two. These computations might be expensive. Eulidean Voronoi’s have been calculated already and stored. Z

Proposed Approach - Spatial Indexing Hierarchical Voronoi Indexing [3] Efficient Paging Mechanism Spatial Proximity very useful. Extensible even to networks.

Challenges When to compute the voronoi diagrams? Repeated computations might be costly. Need to store the computed distances, voronoi partitions, intermediate results in a file. Need for a Hierarchical Index : efficient access points and voronoi seeds.

References 1. Graf, M.,Winter, S., 2003: Netzwerk-Voronoi-Diagramme. Österreichische Zeitschrift für Vermessung und Geoinformation, 91(3): (Network Voronoi Diagrams, english translation)Netzwerk-Voronoi-DiagrammeNetwork Voronoi Diagrams 2.Castro, E., Vladimir and Lee., I (2000). AMOEBA: Hierarchical Clustering Based on Spatial Proximity Using Delaunay Diagram. Proceedings of the 9th International Symposium on Spatial Data Handling (SDH2000). Beijing, China. 3.Gold,C., and Angel., P., 2006 Voronoi Hierarchies LECTURE NOTES IN COMPUTER SCIENCE, pp 99–111, Springer-Verlag Berlin Heidelberg 2006

Thank You