1 Should SDBMS support the Join Index?: A Case study from CrimeStat Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete.

1 Should SDBMS support the Join Index?: A Case study from CrimeStat Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete Celik¹ ¹University of Minnesota, Twin-Cities, {mohan,shekhar,bgeorge,mcelik}@cs.umn.edu ²Ned Levine and Assoicates, Houston, TX, Ned@nedlevine.comNed@nedlevine.com ³National Institute of Justice, Washington D.C, Ronald.Wilson@usdoj.govRonald.Wilson@usdoj.gov

2 Outline  Introduction  Motivation  Problem Statement  Related Work  Contributions  Conclusion and Future Work  Self-Join Index  Experimental Evaluation

3 Motivation Crime Analysis: Where are the burglary hotspots ? Epidemiology: Is Cancer Spatially Clustered ? Transportation: Which major highways require traffic calming measures ? Application Domains Query: Where are the Burglary hotspots ? An Example

4 W-Matrix and W-Queries K-Function Queries that perform a repeated computation of the W-Matrix : W-Queries. W-Matrix W-Queries Moran’s I Geary’s C G Statistic W N : Row Normalized W-Matrix Neighborhood Graph Hotspots

5 W-Operations N3 N1 N2 N6 N5 N7 N4 Notion of neighbors, successors and predecessors. Operations NeighborsSuccessor(s)Predecessor(s)CompositeOthers InputOperation Output get-all-neighbors() get-all-neighbors(N2) N3 N1 N2 N6 N5 N7 N4 get-all-successors() get-all-successors(N2) get-all-predecessors() get-a-successor(N2,Node-id)Delete(N2,N1,N3) get-all-predecessors-of-a- successor() get-a-predecessor-of-a- successor() get-a-successor() get-a-predecessor(N2,Node-id) get-all-predecessors(N2) get-a-predecessor() Delete() get-all-predecessors-of-a-successor(N2, Node-id) get-a-predecessor-of-a- successor(N2,Node-id,Node-id)

6 W-Query Processing Algorithms Algorithm CalcRipleyK  get-all-neighbors(N)  Frequency ← Size(get-all-successors(N)) Algorithm Hotspots_JI  Stage 1: Hotspot Identification  Identify a Seed.  get-all-neighbors(Seed)  get-all-successors(Seed)  Stage 2: Hotspot Refinement  P ← get-a-predecessor-of-a-successor(Seed,succ-id)  If P is Correlates better with the Successor than with the Seed.  Remove the Successor from successor list.  Stage 3: Update Remaining Nodes  For each, S in Hotspot  Delete(S) Input Output K 3000 – 2500 – 2000 – 1500 – 1000 – 500 – 0 – 5 – 10 – 15 – 20 – 30 – 40 – Distance (Miles) Complete Spatial Randomness

7 Problem Statement Given:  A spatial (crime) data warehouse.  A set of W- Operations. Find: A suitable spatial index type representation. Objective:  User response time is minimized. Constraints:  Dataset is updated infrequently.  Concurrency control and recovery considerations are addressed separately. Courtsey: Ned Levine and Associates Input Data Output & W Operations Courtsey: Ned Levine and Associates

8 Challenges  Scalability to Large Datasets Dataset Size = 14852 Crime Reports CrimeStat Libraries’ Response Time = 2Hrs 30 Minutes Query: Where are the Burglary hotspots ?

9 Related Work: Classification SDBMS ToolSpatial Indices SupportedSpatial Self-Join Indices CrimeStatNO Oracle spatialR Tree, Quad TreeNO SQL Server 2008Grid filesNO Post GISR TreeNO ESRI ArcSDEGrid FilesNO SDBMS Tools Current R Tree family index structures perform Repeated on-the-fly W computation. Computationally Expensive!! Our Approach: Pre-computed W ! (Self-join)

10 Contributions  Modeled W-Queries  Proposed a set of W-Operations  W-Query Processing Algorithms  Self-join Index  Representation  Algebraic Cost model: Operations  Experimental Evaluation  Experimental Setup  User Response time analysis

11 Self-Join Index: Representation Key Observations  Classical Join Index : Edge List  Which representation can localize neighbor, successor and predecessor information ? W-Matrix ↔ Self-joinNeighborhood Graph Self-Join Adjacency List Index Edge List Adjacency list LOCALIZES successor, predecessor and row normalized Information Edge List SCATTERS these.  W-Matrix : Neighborhood Graph or Self-join

12 Algebraic Cost Models Overview  Worst case retrieval costs for W -Operations. Notation: a.Let Z be the cost of accessing a single spatial instance from the Self-join Index b.|S|: Average number of successors of a particular node. c.|P|: Average number of predecessors of a particular node. d.Let CRR : Connectivity Residue Ratio (adapted from CCAM) be the probability that a node or a spatial instance is found on a particular page. e.|S R |: is the number of instances satisfying the Neighbor Relation R. f.|S D |: is the total size of the spatial dataset. g.ρ : selectivity for a neighbor query with a neighbor size, R, {|S R |/(|S D |-1)}X|S D |

13 Algebraic Cost Model: Self-Join Index Node Retrieval Cost = Z = 1 lookup cost for a Join Index get-all-neighbors(): = cost of selecting neighbors X probability that the instances that satisfy neighbor relation Cost of selecting neighbors (for one data item) = {|S R | / (|S D |-1)}.|S D |.Z probability that the instances that satisfy the neighbor relation are not in the same page = (1-CRR) Total Cost of Find() = {ρ. Z.(1-CRR) X |S D |} get-all-successors(): = number of successors X probability that the successors are not in the same page X cost of retrieving them = |S|X(1-CRR). Z get-all-predecessors(): = number of predecessors X probability that the predecessors are not in the same page X cost of retrieving them = |P|.(1-CRR). Z

14 Algebraic Cost Model: Self-Join Index get-all-predecessors-of-a-successor(): = probability that a successor is not in the same page X probability that all the predecessors of that successor are not in the same page = (|P|.Z+1).(1-CRR) get-a-predecessor-of-a-successor(): = probability that a successor is not in the same page X probability that all the predecessors of that successor are not in the same page = 2.Z. (1-CRR) Delete( ) : = cost of retrieving all the successors = Z.(1-CRR).| |

15 Experimental Evaluation: Experiment Setup Self-Join Index Generator Candidate Algorithms (CalcRipleyK, Hotspots_JI) Response time Analysis Size of the Police Precincts W Query Processing Algorithms Dataset Size SJALI  Experiment Goals: Compare candidates on response times. Metric of Comparison: Response time Workload: Baltimore Auto theft ’96 (Crime Report ID, Location, Date) Hardware: Intel Xeon 3.2 Ghz, 4 GB RAM Candidates  CrimeStat Libraries  R-Tree: Tree Matching  Self-Join Index

16 Baltimore Auto-theft Dataset Crime Report Baltimore County Auto Thefts from Jan 1996 to Sept 1996: 14852 Crime Reports Courtsey: Ned Levine and Associates(www.nedlevine.com )www.nedlevine.com

17 Response Time Analysis: Comparison with R-Tree Response time comparison for hotspot identification. Response time comparison for K-Function computation. Questions: How does the response time of the Ripley’s K function Query vary with dataset size ? How does the response time of the Hotspot Identification Query vary with dataset size ? Fixed Parameters Hotspots Hotspot min-Size Threshold = 10 Crime Reports K Function # of max-significance levels = 100 Overall Trend: Self-join Index Vs R-Tree: Response time Reduced by a factor of 2.

18 Response Time Analysis: Comparison with CrimeStat Response time comparison for hotspot identification.Response time comparison for K-Function computation. Questions: How does the response time of the Ripley’s K function Query vary with dataset size ? How does the response time of the Hotspot Identification Query vary with dataset size ? Fixed Parameters Hotspots Hotspot min-Size Threshold = 10 Crime Reports K Function # of max-significance levels = 100 Overall Trend: Self-join Index Vs CrimeStat: Response time Reduced by a factor of 40.

19 Conclusions  W-Queries important in Spatial Statistics, e.g. Crime analysis, Public health, transportation.  W-Operations of W-Queries.  Self-join adjacency list index more scalable than R-Tree and CrimeStat. Future work  Experimental Quantification  I/O costs of W-Query Processing Algorithms.  I/O Cost Models for W-Query Processing Algorithms.  Further I/O Optimization  Extracting optimal page access sequences for processing W-Queries.  Optimizing the number of W-Query operations.  Other W-Queries  Local Moran’s I, Local Getis Ord.  Larger datasets of >=100000, will R-Tree be comparable ?

20 Acknowledgment  Members of the Spatial Database and Data Mining Research Group University of Minnesota, Twin-Cities.  This Work was supported by Grants from NSF, USDOD and NIJ. Thank You for your Questions, Comments and Patience!

1 Should SDBMS support the Join Index?: A Case study from CrimeStat Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete.

Similar presentations

Presentation on theme: "1 Should SDBMS support the Join Index?: A Case study from CrimeStat Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Should SDBMS support the Join Index?: A Case study from CrimeStat Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete.

Similar presentations

Presentation on theme: "1 Should SDBMS support the Join Index?: A Case study from CrimeStat Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete."— Presentation transcript:

Similar presentations

About project

Feedback