Data Mining on NIJ data Sangjik Lee. Unstructured Data Mining Text Keyword Extraction Structured Data Base Data Mining Image Feature Extraction Structured.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Machine Learning Lecture 8 Data Processing and Representation
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
Principal Component Analysis
Chapter 2: Pattern Recognition
Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Face Recognition Jeremy Wyatt.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Implementing a reliable neuro-classifier
CSE 300: Software Reliability Engineering Topics covered: Software metrics and software reliability.
Smart Traveller with Visual Translator for OCR and Face Recognition LYU0203 FYP.
College Algebra Fifth Edition James Stewart Lothar Redlin Saleem Watson.
Principle Component Analysis (PCA) Networks (§ 5.8) PCA: a statistical procedure –Reduce dimensionality of input vectors Too many features, some of them.
嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks.
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Chapter 3 Data Exploration and Dimension Reduction 1.
1 Template-Based Classification Method for Chinese Character Recognition Presenter: Tienwei Tsai Department of Informaiton Management, Chihlee Institute.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Non Negative Matrix Factorization
BACKGROUND LEARNING AND LETTER DETECTION USING TEXTURE WITH PRINCIPAL COMPONENT ANALYSIS (PCA) CIS 601 PROJECT SUMIT BASU FALL 2004.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.
MedIX – Summer 07 Lucia Dettori (room 745)
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
Performance Comparison of Speaker and Emotion Recognition
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Data Mining and Decision Support
Feature Selection and Dimensionality Reduction. “Curse of dimensionality” – The higher the dimensionality of the data, the more data is needed to learn.
1 Overview representing region in 2 ways in terms of its external characteristics (its boundary)  focus on shape characteristics in terms of its internal.
Introduction to Scale Space and Deep Structure. Importance of Scale Painting by Dali Objects exist at certain ranges of scale. It is not known a priory.
Instructor: Mircea Nicolescu Lecture 10 CS 485 / 685 Computer Vision.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Unsupervised Learning II Feature Extraction
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Optical Character Recognition
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Machine Learning with Spark MLlib
An Image Database Retrieval Scheme Based Upon Multivariate Analysis and Data Mining Presented by C.C. Chang Dept. of Computer Science and Information.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Principle Component Analysis (PCA) Networks (§ 5.8)
School of Computer Science & Engineering
Classification with Perceptrons Reading:
Principal Component Analysis
ECE539 final project Instructor: Yu Hen Hu Fall 2005
PCA is “an orthogonal linear transformation that transfers the data to a new coordinate system such that the greatest variance by any projection of the.
Introduction PCA (Principal Component Analysis) Characteristics:
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Generally Discriminant Analysis
Data Transformations targeted at minimizing experimental variance
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.
Feature Selection Methods
Principal Component Analysis
Automatic Handwriting Generation
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
A task of induction to find patterns
Presentation transcript:

Data Mining on NIJ data Sangjik Lee

Unstructured Data Mining Text Keyword Extraction Structured Data Base Data Mining Image Feature Extraction Structured Data Base Data Mining

Handwritten CEDAR Letter

Document Level Features 1. Entropy 2. Gray-level threshold 3. Number of black pixels 4. Stroke width 5. Number of interior contours 6. Number of exterior contours 7. Number of vertical slope components 8. Number of horizontal slope components 9. Number of negative slope components 10. Number of positive slope components 11. Slant 12. Height Measure of Pen Pressure Measure of Writing Movement Measure of Stroke Formation Slant Word Proportion

Character Level Features S y (i,j) tan S x (i,j)

Gradient : (192) Structure : (192) Concavity : (128) Character Level Features

Writer dataFeature data (normalized) Gen Age Han Edu Ethn Sch M F 85 L R H C H W B A O U F dark blob hole slant width skew ht int int int real int real int Writer and Feature Data

Instances of the Data (normalized) Feature document level data (12 features) Entropy dark pixel blob hole hslope nslope pslope vslope slant width ht real int int int int int int int int real int int

White maleWhite femaleBlack femaleBlack male Data Mining on sub-group

Data Mining on sub-group (Cont.) Subgroup analysis is useful information to be mined. 1-constraint subgroups {Male: Female}, {White : Black : Hispanic}, etc. 2-constraints subgroups {Male-white: Female-white}, etc. 3-constraints subgroups {Male-white-25~45: Female-white-25~45}, etc. Gen Age Han Edu Ethn Sch M F 85 L R H C H W B A O U F There are a combinatorially large number of subgroups.

Gender Age Handedness Ethnicity eDucation Schooling G W SDEHA If |W| < support, reject Constraints 1 GAGHAHAEADASHEHDHSEDESDSGSGDGE 2 GAEGADGAHGASGHEGHDGHSGEDGESGDSAHE 3 ……... GAHEDS... subgroups

Database ~ Normalized feature data Raw feature data Writer data Color Scale

Feature Database (White and Black) 12~24 25~44 45~64 >= 65 whiteblack Female whiteblack Male

What to do 1. Feature Selection Process that chooses an optimal subset of features according to a certain criterion (Feature Selection for knowledge discovery and data mining by Huan Liu and Hiroshi Motoda) Since there are limited number of writer in each sub-group, reduced subset of features is needed. To improve performance (speed of learning, predictive accuracy, or simplicity of rules) To visualize the data for model selection To reduce dimensionality and remove noise

Feature Selection Example of feature selection Feature 1-2 ~ 2-3Feature 6-10 ~ Feature 9-10 ~ Knowing that some features are highly correlated to some others can help removing redundant features

What to do 2. Visualization of trend (if any) of writer sub-groups Useful tool so that we can quickly obtain an overall structural view of the trend of sub-group Seeing is Believing !

Implementation of Subgroup Analysis on NIJ Data Writer Data Find a subgroup that has enouth support Data Preparation Subgroup Classifier Feature Data Task: Which writer subgroup is more distinguishable than others (if any)?

The Result of Subgroup Classification Results Procedure for writer subgroup analysis  Find subgroup that has enough support  Choose ‘the other’ (complement) group  Make data sets(4) for Artificial Neural Network  Train ANN and get the results from two test sets Limit  3 categoris are used (gender, ethnicity and age)  up to 2 constraints are considered  only Document-level features are used

1 Subgroup Classifier dark blob hole slant height Artificial neural network (11-6-1) This is a test. This is a sample writing for document 1 written by an author a. Feature space representation of Handwritten document is This is a test. This is a sample writing for document 1 written by an author a. of Handwritten document is Feature extraction Writer is Which group?

The Result of Subgroup Classification Results

They’re distinguishable, but why... Need to explain why they’re distinguishable ANN does a good job, but can’t explain clearly its output 12 features are too many to explain and visualize Only 2 (or 3) dimensions are visualizable Question : Does a reasonable two or three dimensional representation of the data exist that may be analyzed visually? Reference : Feature Selection for Knowledge Discovery and Data Mining - Huan Liu and Hiroshi Motoda

Feature Extraction Common characteristic of feature extraction methods is that they all produce new features y based on the original features x. After feature extraction, representation of data is changed so that many techniques such as visualization, decision tree building can be conveniently used. Feature extraction started, as early as in 60’s and 70’s, as a problem of finding the intrinsic dimensionality of a data set - the minimum number of independent features required to generate the instances

Visualization Perspective Data of high dimensions cannot be analyzed visually It is often necessary to reduce it’s dimensionality in order to visualize the data The most popular method of determining topological dimensionality is the Karhunen-Loeve (K-L) method (also called Principal Component Analysis) which is based on the eigenvalues of a covariance matrix(R) computed from the data

Visualization Perspective The M eigenvectors corresponding to the M largest eigenvalues of R define a linear transformation from the N- dimensional space to an M-dimensional space in which the features are uncorrelated. This property of uncorrelated features is derived from a theorem stating that if the eigenvalues of a matrix are distinct, then the associated eigenvectors are linearly independent For the purpose of visualization, one may take the M features corresponding to the M largest eigenvalues of R

Applied to the NIJ data 1. Normalize each feature’s values into a range [0,1] 2. Obtain the correlation matrix for the 12 original features 3. Find eigenvalues of the correlation matrix 4. Select the largest two eigenvalues should be chosen 5. Output the chosen eigenvectors associated with the chosen eigenvalues. Here we obtain a 12 * 2 transformation matrix M 6. Transform the normalized data D old into data D new of extracted features as follows: D new = D old M The resulting data is of 2-dimensional having the original class label attached to each instance

Applied to the NIJ data

Sample Iris data (the original is 4-dimensional)