1 Should SDBMS support the Join Index?: A Case study from CrimeStat Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete.

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.

Evaluating “find a path” reachability queries P. Bouros 1, T. Dalamagas 2, S.Skiadopoulos 3, T. Sellis 1,2 1 National Technical University of Athens 2.

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Spatial Dependency Modeling Using Spatial Auto-Regression Mete Celik 1,3, Baris M. Kazar 4, Shashi Shekhar 1,3, Daniel Boley 1, David J. Lilja 1,2 1 CSE.

School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.

Transitioning Experiences with Army Geo Spatial Center (AGC) Pradeep Mohan 4 th Year PhD Student 1.

Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi.

--Presented By Sudheer Chelluboina. Professor: Dr.Maggie Dunham.

Cascading Spatio-Temporal Pattern Discovery P. Mohan, S.Shekhar, J. Shine, J. Rogers CSci 8715 Presented by: Atanu Roy Akash Agrawal.

Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.

Spatial Outlier Detection and implementation in Weka Implemented by: Shan Huang Jisu Oh CSCI8715 Class Project, April Presented by Jisu.

An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.

A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota

Panelist: Shashi Shekhar McKnight Distinguished Uninversity Professor University of Minnesota Cyber-Infrastructure (CI) Panel,

Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.

Scalable Network Distance Browsing in Spatial Database Samet, H., Sankaranarayanan, J., and Alborzi H. Proceedings of the 2008 ACM SIGMOD international.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Team 22 Rami Alghamdi & Ritika Jhangiani.  Baltimore County Police  Hot Spot Analysis (Frequent Crime locations)  K-Means Clustering Crime Location.

Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

Data Mining Techniques

Chapter 1: Introduction to Spatial Databases 1.1 Overview 1.2 Application domains 1.3 Compare a SDBMS with a GIS 1.4 Categories of Users 1.5 An example.

Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Time-focused density-based clustering of trajectories of moving objects Margherita D’Auria Mirco Nanni Dino Pedreschi.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

Low-Power Gated Bus Synthesis for 3D IC via Rectilinear Shortest-Path Steiner Graph Chung-Kuan Cheng, Peng Du, Andrew B. Kahng, and Shih-Hung Weng UC San.

Mapping and analysis for public safety: An Overview.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “QUERY OPTIMIZATION” Academic Year 2014 Spring.

PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.

Multi-Criteria Routing in Pervasive Environment with Sensors Santhanakrishnan, G., Li, Q., Beaver, J., Chrysanthis, P.K., Amer, A. and Labrinidis, A Department.

Access Pattern Analysis, Ideas and Alternative Approaches Pradeep Mohan Crimestat: Performance Tuning.

Switch off your Mobiles Phones or Change Profile to Silent Mode.

Clustering Moving Objects in Spatial Networks Jidong Chen, Caifeng Lai, Xiaofeng Meng, Renmin University of China Jianliang Xu, and Haibo Hu Hong Kong.

RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.

Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.

Leonardo Guerreiro Azevedo Geraldo Zimbrão Jano Moreira de Souza Approximate Query Processing in Spatial Databases Using Raster Signatures Federal University.

Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:

United States Department of Justice Criminal Division Geographic Information Systems Staff Regional Crime Analysis Geographic Information System (RCAGIS)

August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.

An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)

Presentation Template KwangSoo Yang Florida Atlantic University College of Engineering & Computer Science.

Performance and Energy Efficiency Evaluation of Big Data Systems Presented by Yingjie Shi Institute of Computing Technology, CAS

Comparison of Tarry’s Algorithm and Awerbuch’s Algorithm CS 6/73201 Advanced Operating System Presentation by: Sanjitkumar Patel.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Copyright © 2001, SAS Institute Inc. All rights reserved. Data Mining Methods: Applications, Problems and Opportunities in the Public Sector John Stultz,

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Graph Indexing From managing and mining graph data.

1 Using Network Coding for Dependent Data Broadcasting in a Mobile Environment Chung-Hua Chu, De-Nian Yang and Ming-Syan Chen IEEE GLOBECOM 2007 Reporter.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.

Presented by: Mi Tian, Deepan Sanghavi, Dhaval Dholakia

CS522 Advanced database Systems

Distributed Network Traffic Feature Extraction for a Real-time IDS

Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS

Storage and Indexes Chapter 8 & 9

On Efficient Graph Substructure Selection

External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.

Selected Topics: External Sorting, Join Algorithms, …

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Donghui Zhang, Tian Xia Northeastern University

Presentation transcript:

1 Should SDBMS support the Join Index?: A Case study from CrimeStat Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete Celik¹ ¹University of Minnesota, Twin-Cities, ²Ned Levine and Assoicates, Houston, TX, ³National Institute of Justice, Washington D.C,

2 Outline  Introduction  Motivation  Problem Statement  Related Work  Contributions  Conclusion and Future Work  Self-Join Index  Experimental Evaluation

3 Motivation Crime Analysis: Where are the burglary hotspots ? Epidemiology: Is Cancer Spatially Clustered ? Transportation: Which major highways require traffic calming measures ? Application Domains Query: Where are the Burglary hotspots ? An Example

4 W-Matrix and W-Queries K-Function Queries that perform a repeated computation of the W-Matrix : W-Queries. W-Matrix W-Queries Moran’s I Geary’s C G Statistic W N : Row Normalized W-Matrix Neighborhood Graph Hotspots

5 W-Operations N3 N1 N2 N6 N5 N7 N4 Notion of neighbors, successors and predecessors. Operations NeighborsSuccessor(s)Predecessor(s)CompositeOthers InputOperation Output get-all-neighbors() get-all-neighbors(N2) N3 N1 N2 N6 N5 N7 N4 get-all-successors() get-all-successors(N2) get-all-predecessors() get-a-successor(N2,Node-id)Delete(N2,N1,N3) get-all-predecessors-of-a- successor() get-a-predecessor-of-a- successor() get-a-successor() get-a-predecessor(N2,Node-id) get-all-predecessors(N2) get-a-predecessor() Delete() get-all-predecessors-of-a-successor(N2, Node-id) get-a-predecessor-of-a- successor(N2,Node-id,Node-id)

6 W-Query Processing Algorithms Algorithm CalcRipleyK  get-all-neighbors(N)  Frequency ← Size(get-all-successors(N)) Algorithm Hotspots_JI  Stage 1: Hotspot Identification  Identify a Seed.  get-all-neighbors(Seed)  get-all-successors(Seed)  Stage 2: Hotspot Refinement  P ← get-a-predecessor-of-a-successor(Seed,succ-id)  If P is Correlates better with the Successor than with the Seed.  Remove the Successor from successor list.  Stage 3: Update Remaining Nodes  For each, S in Hotspot  Delete(S) Input Output K 3000 – 2500 – 2000 – 1500 – 1000 – 500 – 0 – 5 – 10 – 15 – 20 – 30 – 40 – Distance (Miles) Complete Spatial Randomness

7 Problem Statement Given:  A spatial (crime) data warehouse.  A set of W- Operations. Find: A suitable spatial index type representation. Objective:  User response time is minimized. Constraints:  Dataset is updated infrequently.  Concurrency control and recovery considerations are addressed separately. Courtsey: Ned Levine and Associates Input Data Output & W Operations Courtsey: Ned Levine and Associates

8 Challenges  Scalability to Large Datasets Dataset Size = Crime Reports CrimeStat Libraries’ Response Time = 2Hrs 30 Minutes Query: Where are the Burglary hotspots ?

9 Related Work: Classification SDBMS ToolSpatial Indices SupportedSpatial Self-Join Indices CrimeStatNO Oracle spatialR Tree, Quad TreeNO SQL Server 2008Grid filesNO Post GISR TreeNO ESRI ArcSDEGrid FilesNO SDBMS Tools Current R Tree family index structures perform Repeated on-the-fly W computation. Computationally Expensive!! Our Approach: Pre-computed W ! (Self-join)

10 Contributions  Modeled W-Queries  Proposed a set of W-Operations  W-Query Processing Algorithms  Self-join Index  Representation  Algebraic Cost model: Operations  Experimental Evaluation  Experimental Setup  User Response time analysis

11 Self-Join Index: Representation Key Observations  Classical Join Index : Edge List  Which representation can localize neighbor, successor and predecessor information ? W-Matrix ↔ Self-joinNeighborhood Graph Self-Join Adjacency List Index Edge List Adjacency list LOCALIZES successor, predecessor and row normalized Information Edge List SCATTERS these.  W-Matrix : Neighborhood Graph or Self-join

12 Algebraic Cost Models Overview  Worst case retrieval costs for W -Operations. Notation: a.Let Z be the cost of accessing a single spatial instance from the Self-join Index b.|S|: Average number of successors of a particular node. c.|P|: Average number of predecessors of a particular node. d.Let CRR : Connectivity Residue Ratio (adapted from CCAM) be the probability that a node or a spatial instance is found on a particular page. e.|S R |: is the number of instances satisfying the Neighbor Relation R. f.|S D |: is the total size of the spatial dataset. g.ρ : selectivity for a neighbor query with a neighbor size, R, {|S R |/(|S D |-1)}X|S D |

13 Algebraic Cost Model: Self-Join Index Node Retrieval Cost = Z = 1 lookup cost for a Join Index get-all-neighbors(): = cost of selecting neighbors X probability that the instances that satisfy neighbor relation Cost of selecting neighbors (for one data item) = {|S R | / (|S D |-1)}.|S D |.Z probability that the instances that satisfy the neighbor relation are not in the same page = (1-CRR) Total Cost of Find() = {ρ. Z.(1-CRR) X |S D |} get-all-successors(): = number of successors X probability that the successors are not in the same page X cost of retrieving them = |S|X(1-CRR). Z get-all-predecessors(): = number of predecessors X probability that the predecessors are not in the same page X cost of retrieving them = |P|.(1-CRR). Z

14 Algebraic Cost Model: Self-Join Index get-all-predecessors-of-a-successor(): = probability that a successor is not in the same page X probability that all the predecessors of that successor are not in the same page = (|P|.Z+1).(1-CRR) get-a-predecessor-of-a-successor(): = probability that a successor is not in the same page X probability that all the predecessors of that successor are not in the same page = 2.Z. (1-CRR) Delete( ) : = cost of retrieving all the successors = Z.(1-CRR).| |

15 Experimental Evaluation: Experiment Setup Self-Join Index Generator Candidate Algorithms (CalcRipleyK, Hotspots_JI) Response time Analysis Size of the Police Precincts W Query Processing Algorithms Dataset Size SJALI  Experiment Goals: Compare candidates on response times. Metric of Comparison: Response time Workload: Baltimore Auto theft ’96 (Crime Report ID, Location, Date) Hardware: Intel Xeon 3.2 Ghz, 4 GB RAM Candidates  CrimeStat Libraries  R-Tree: Tree Matching  Self-Join Index

16 Baltimore Auto-theft Dataset Crime Report Baltimore County Auto Thefts from Jan 1996 to Sept 1996: Crime Reports Courtsey: Ned Levine and Associates( )

17 Response Time Analysis: Comparison with R-Tree Response time comparison for hotspot identification. Response time comparison for K-Function computation. Questions: How does the response time of the Ripley’s K function Query vary with dataset size ? How does the response time of the Hotspot Identification Query vary with dataset size ? Fixed Parameters Hotspots Hotspot min-Size Threshold = 10 Crime Reports K Function # of max-significance levels = 100 Overall Trend: Self-join Index Vs R-Tree: Response time Reduced by a factor of 2.

18 Response Time Analysis: Comparison with CrimeStat Response time comparison for hotspot identification.Response time comparison for K-Function computation. Questions: How does the response time of the Ripley’s K function Query vary with dataset size ? How does the response time of the Hotspot Identification Query vary with dataset size ? Fixed Parameters Hotspots Hotspot min-Size Threshold = 10 Crime Reports K Function # of max-significance levels = 100 Overall Trend: Self-join Index Vs CrimeStat: Response time Reduced by a factor of 40.

19 Conclusions  W-Queries important in Spatial Statistics, e.g. Crime analysis, Public health, transportation.  W-Operations of W-Queries.  Self-join adjacency list index more scalable than R-Tree and CrimeStat. Future work  Experimental Quantification  I/O costs of W-Query Processing Algorithms.  I/O Cost Models for W-Query Processing Algorithms.  Further I/O Optimization  Extracting optimal page access sequences for processing W-Queries.  Optimizing the number of W-Query operations.  Other W-Queries  Local Moran’s I, Local Getis Ord.  Larger datasets of >=100000, will R-Tree be comparable ?

20 Acknowledgment  Members of the Spatial Database and Data Mining Research Group University of Minnesota, Twin-Cities.  This Work was supported by Grants from NSF, USDOD and NIJ. Thank You for your Questions, Comments and Patience!