A Method for the Comparison of Criminal Cases using digital documents A New Distance Measure T.K. Cocx, tcocx@liacs.nl 5/23/2019
5/23/2019 T.K. Cocx, tcocx@liacs.nl
5/23/2019 T.K. Cocx, tcocx@liacs.nl
Comparing Documents Data mining: the search for knowledge in large amounts of data. Data: digital documents found on crime scene or fabricated by police describing the crime scene Knowledge: what crime labs may be setup by the same group of criminals Data mining tools: Text mining: extraction of entities from documents Distance measure on extracted output: document similarity Visualization: clustering of documents on screen 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Coupled investigation table 4-step paradigm Documents Extraction table Investigation Amount Type Entity Text mining Coupled investigation table In Common Investigation 2 Investigation 1 Transformation Distance Matrix 0.92 0.27 … 0.51 2 1 Distance Measure Clustering Visualization 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Process characteristics Documents Contain potentially useful information, but is Usually unstructured Typing mistakes Police reports: polluted with terminology Text mining the process of extracting interesting and non-trivial information and knowledge from unstructured text. Names, locations, cars (plates), products, url’s email address, IP’s Language bound 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Process characteristics Extraction table Primary key: entity & investigation so, The table stores all occurrences of entities and their respective types in the investigations Wim, person, inv liacs Joost, person, inv liacs Sjoerd, person, inv liacs Sjoerd, person, inv mi Joost, person, inv lumc 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Process characteristics Transformation Goal: compare investigations Transform table to an investigation primary keyed table. All investigations contain Boolean information about occurrence of particular entity. Number of field number of different entities. Could be clustered in ‘n’ dimensions. Should be downscaled Coupled investigation table Contains less dimensional data about investigations 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Process characteristics Distance measure Use overlap in occurrences to constitute distance More overlap closer The closer two investigations are, the more similar they are, the more likely they are related to the same group of criminals. Between 0 and 1 Difference in size Supermarket vs. police investigation Relative comparison Distance Matrix Contains all distances 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Process characteristics Visualization Distance not necessarily defined in 2 dimensions In some way display investigations as correctly as possible Employs iterative push and pull technique Clustering Investigation comparison report easily readable by police analyst. 5/23/2019 T.K. Cocx, tcocx@liacs.nl
On to the details… “So, how does this all actually works??” 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Text mining No large-scale domain specific text miner available. Police decision dictates employment of SPSS LexiQuest No filtering on police terminology Based upon English engine Wrong classification of entities in approximately 78% of time (incl. 68%% classified as Unknown) 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Table transformation Use simple SQL to transform the extraction table to a high dimensional table. This table contains the occurrences per investigation: 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Distance Measure: supermarket Comparison of shopping behavior 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Comparison shopping & crime Crime: no information does not constitute dissimilarity Incorporate size while taking this into account 5/23/2019 T.K. Cocx, tcocx@liacs.nl
New distance measure Use difference between statistical expected amount of common entities and actual overlapping amount. 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Problem: A What is the total Universe of entities Language: infeasible Total amount of distinct entities in database Invert expected value function to calculate average universe size: 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Employ normal distribution 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Distance function 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Resulting graphs 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Visualization Impossible to display clustering in 2 dimensions perfectly Approach best possible fit Place all investigations randomly in the X,Y plane Calculate couple wise error made in placement. Correct couple wise through push and pull technique Repeat from 2 until total error is at a (local) minimum 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Push and Pull 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Visualization Example 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Results Universe A Universe Averaged 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Future research Domain specific/ domain trained text miner necessary to improve results. Qualitative police feedback on results Incorporating this feedback in design decisions Use number of occurrences instead of Booleans Select on type (omit Unknown) Choose between different universes 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Demonstration 5/23/2019 T.K. Cocx, tcocx@liacs.nl
Interrogation 5/23/2019 T.K. Cocx, tcocx@liacs.nl