A Method for the Comparison of Criminal Cases using digital documents

Slides:



Advertisements
Similar presentations
Design of Experiments Lecture I
Advertisements

Chapter 5: Introduction to Information Retrieval
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
6/3/2015 T.K. Cocx, Prediction of criminal careers through 2- dimensional Extrapolation W. Kosters et al.
Chapter 9 DATA WAREHOUSING Transparencies © Pearson Education Limited 1995, 2005.
3-1 Chapter 3 Data and Knowledge Management
DATA WAREHOUSING.
Innovations in Justice Information Sharing Strategies and Best Practices February 2007 Melissa R. Johnson, CCA Communications Director, International Association.
CORE 2: Information systems and Databases STORAGE & RETRIEVAL 2 : SEARCHING, SELECTING & SORTING.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Evaluating Performance for Data Mining Techniques
Data Mining Techniques
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Search Engines and Information Retrieval Chapter 1.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Innovations in Justice Information Sharing Strategies and Best Practices November 30, 2006 Lisa M. Palmieri, CCA-Supervisory Intelligence Analyst President,
Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM DKE Data & Knowledge.
Data Mining – A First View Roiger & Geatz. Definition Data mining is the process of employing one or more computer learning techniques to automatically.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
MIS2502: Data Analytics Advanced Analytics - Introduction.
Data Mining and Decision Support
Multidimensional Scaling
Week 1 Intro to the Course Intro to Databases.  Formerly ISP 121  “Continuation” of LSP 120 concepts  Topics include: ◦ Databases ◦ Basic statistics.
Information and Information Technology 1. Information and employment 2.
Information Retrieval in Practice
26. Classification Accuracy Assessment
Unsupervised Learning
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
MIS2502: Data Analytics Advanced Analytics - Introduction
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
Data Mining, Neural Network and Genetic Programming
Entity Relationship (E-R) Modeling
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Data warehouse and OLAP
Physical Database Design and Performance
Introduction to Data Mining
Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Chapter Three Research Design.
Basic Concepts in Data Management
Vehicle Segmentation and Tracking in the Presence of Occlusions
Data Warehousing and Data Mining
PERFORMANCE AND TALENT MANAGEMENT
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Prediction of criminal careers through 2-dimensional Extrapolation
Clustering Wei Wang.
A Method for the Comparison of Criminal Cases using digital documents
Chapter 7: The Distribution of Sample Means
Nearest Neighbors CSC 576: Data Mining.
MIS2502: Data Analytics Introduction to Advanced Analytics
The ultimate in data organization
Text Categorization Berlin Chen 2003 Reference:
Adapting and Visualizing Association Rule Mining Systems
Data Warehousing Concepts
Building Valid, Credible, and Appropriately Detailed Simulation Models
Data Pre-processing Lecture Notes for Chapter 2
COMPARING VARIABLES OF ORDINAL OR DICHOTOMOUS SCALES: SPEARMAN RANK- ORDER, POINT-BISERIAL, AND BISERIAL CORRELATIONS.
Information Systems Development MIS331
Pattern Recognition and Training
Pattern Recognition and Training
Unsupervised Learning
Business Statistics For Contemporary Decision Making 9th Edition
An Introduction to Data Science using Python
Presentation transcript:

A Method for the Comparison of Criminal Cases using digital documents A New Distance Measure T.K. Cocx, tcocx@liacs.nl 5/23/2019

5/23/2019 T.K. Cocx, tcocx@liacs.nl

5/23/2019 T.K. Cocx, tcocx@liacs.nl

Comparing Documents Data mining: the search for knowledge in large amounts of data. Data: digital documents found on crime scene or fabricated by police describing the crime scene Knowledge: what crime labs may be setup by the same group of criminals Data mining tools: Text mining: extraction of entities from documents Distance measure on extracted output: document similarity Visualization: clustering of documents on screen 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Coupled investigation table 4-step paradigm Documents Extraction table Investigation Amount Type Entity Text mining Coupled investigation table In Common Investigation 2 Investigation 1 Transformation Distance Matrix 0.92 0.27 … 0.51 2 1 Distance Measure Clustering Visualization 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Process characteristics Documents Contain potentially useful information, but is Usually unstructured Typing mistakes Police reports: polluted with terminology Text mining the process of extracting interesting and non-trivial information and knowledge from unstructured text. Names, locations, cars (plates), products, url’s email address, IP’s Language bound 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Process characteristics Extraction table Primary key: entity & investigation so, The table stores all occurrences of entities and their respective types in the investigations Wim, person, inv liacs Joost, person, inv liacs Sjoerd, person, inv liacs Sjoerd, person, inv mi Joost, person, inv lumc 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Process characteristics Transformation Goal: compare investigations Transform table to an investigation primary keyed table. All investigations contain Boolean information about occurrence of particular entity. Number of field  number of different entities. Could be clustered in ‘n’ dimensions. Should be downscaled Coupled investigation table Contains less dimensional data about investigations 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Process characteristics Distance measure Use overlap in occurrences to constitute distance More overlap  closer The closer two investigations are, the more similar they are, the more likely they are related to the same group of criminals. Between 0 and 1 Difference in size Supermarket vs. police investigation Relative comparison Distance Matrix Contains all distances 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Process characteristics Visualization Distance not necessarily defined in 2 dimensions In some way display investigations as correctly as possible Employs iterative push and pull technique Clustering Investigation comparison report easily readable by police analyst. 5/23/2019 T.K. Cocx, tcocx@liacs.nl

On to the details… “So, how does this all actually works??” 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Text mining No large-scale domain specific text miner available. Police decision dictates employment of SPSS LexiQuest No filtering on police terminology Based upon English engine Wrong classification of entities in approximately 78% of time (incl. 68%% classified as Unknown) 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Table transformation Use simple SQL to transform the extraction table to a high dimensional table. This table contains the occurrences per investigation: 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Distance Measure: supermarket Comparison of shopping behavior 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Comparison shopping & crime Crime: no information does not constitute dissimilarity Incorporate size while taking this into account 5/23/2019 T.K. Cocx, tcocx@liacs.nl

New distance measure Use difference between statistical expected amount of common entities and actual overlapping amount. 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Problem: A What is the total Universe of entities Language: infeasible Total amount of distinct entities in database Invert expected value function to calculate average universe size: 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Employ normal distribution 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Distance function 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Resulting graphs 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Visualization Impossible to display clustering in 2 dimensions perfectly Approach best possible fit Place all investigations randomly in the X,Y plane Calculate couple wise error made in placement. Correct couple wise through push and pull technique Repeat from 2 until total error is at a (local) minimum 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Push and Pull 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Visualization Example 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Results Universe A Universe Averaged 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Future research Domain specific/ domain trained text miner necessary to improve results. Qualitative police feedback on results Incorporating this feedback in design decisions Use number of occurrences instead of Booleans Select on type (omit Unknown) Choose between different universes 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Demonstration 5/23/2019 T.K. Cocx, tcocx@liacs.nl

Interrogation 5/23/2019 T.K. Cocx, tcocx@liacs.nl