CAMCOS Report Day December 9th, 2015 San Jose State University

Slides:



Advertisements
Similar presentations
1 Image Classification MSc Image Processing Assignment March 2003.
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
An Overview of Machine Learning
Supervised Learning Recap
Chapter 4: Linear Models for Classification
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Classification and Decision Boundaries
Lecture 14 – Neural Networks
Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Radial-Basis Function Networks
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks.
July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.
This week: overview on pattern recognition (related to machine learning)
Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
1 Pattern Classification X. 2 Content General Method K Nearest Neighbors Decision Trees Nerual Networks.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Non-Bayes classifiers. Linear discriminants, neural networks.
Handwritten digit recognition
Linear Models for Classification
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
Deep Feedforward Networks
Artificial Neural Networks
Data Mining, Neural Network and Genetic Programming
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
The Elements of Statistical Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Basic machine learning background with Python scikit-learn
Machine Learning Basics
Data Mining Lecture 11.
Machine Learning Dimensionality Reduction
An Introduction to Support Vector Machines
Machine Learning Week 1.
K Nearest Neighbor Classification
Machine Learning Today: Reading: Maria Florina Balcan
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Neuro-Computing Lecture 4 Radial Basis Function Network
network of simple neuron-like computing elements
COSC 4335: Other Classification Techniques
Artificial Intelligence Lecture No. 28
Generally Discriminant Analysis
Concave Minimization for Support Vector Machine Classifiers
Linear Discrimination
CAMCOS Report Day December 9th, 2015 San Jose State University
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Support Vector Machines 2
Presentation transcript:

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions CAMCOS Report Day December 9th, 2015 San Jose State University Project Theme: Classification

The Kaggle Competition Kaggle is an international platform that hosts data prediction competitions Students and experts in data science compete Our CAMCOS team entered two competitions Team 1: Digit Recognizer (Ends December 31st) Team 2: Springleaf Marketing Response (Ended October 19th)

Overview of This CAMCOS Team 1 Wilson A. Florero-Salinas Carson Sprook Dan Li Abhirupa Sen Problem: Given an image of a handwritten digit, determine which digit it is. Team 2 Xiaoyan Chong Minglu Ma Yue Wang Sha Li Problem: Identify potential customers for direct marketing. Project supervisor: Dr. Guangliang Chen

Presentation Outline (Team 1) The Digit Recognition Problem Classification in our Data Set Data Preprocessing Classification Algorithms Summary computer 9 Theme: Classification Problem

Classification in our data set Prediction New Points Training Set Model Training Training Points ? Model Algorithm Training Labels 1 Predicted Labels 9 Goal: use the training set to predict the label of a new data set. MNIST Data

Team 1: The MNIST1 data set 28x28 images of handwritten digits 0,1,...,9 size-normalized and centered 60,000 used for training 10,000 used for testing 1 subset of data collected by NIST, the US's National Institute of Standards and Technology

Potential Applications Banking: Check deposits Surveillance: license plates Shipping: Envelopes/Packages

Initial Challenges and Solutions High dimensional data set Images stored as 784x1 vectors Computationally expensive Digits are written differently by different people Left-handed vs right-handed Preprocess the data set Reduce dimension→ increase computation speed Apply some transformation→ enhance features important for classification

Data Preprocessing Methods In our experiments we have used the following methods Deskewing Principal Component Analysis (PCA) Linear Discriminant Analysis (LDA) 2D LDA Nonparametric Discriminant Analysis (NDA) Kernel PCA t-Distributed Stochastic Neighbor Embedding (t-sne) parametric t-sne kernel t-sne PCA & LDA

Principal Component Analysis (PCA) Using too many dimensions (784) can be computationally expensive. Uses variance as dimensionality reduction criterion Throw away directions with lowest variance

Linear Discriminant Analysis: Reduce dimensionality, preserve as much class discriminatory information as possible. Methods

Classification Methods In our experiments we have used the following methods Nearest Neighbors Methods (Instance based) Naive Bayes (NB) Maximum a Posterior (MAP) Logistic Regression Support Vector Machines (Linear Classifier) Neural Networks Random Forests Xgboost

K nearest neighbors A new data point is assigned to the group of its k nearest neighbors k = 5 k = 9 Majority of the neighbors are from class 2. Test data is closer to class 1.

K means Situation 2: Data has non convex clustering. Situation 1: Data is well separated. Each class has a centroid/average. Situation 2: Data has non convex clustering. centroid 1 centroid 1 centroid 2 centroid 2 centroid 3 Test data is predicted to be from class 3. Test data belongs to class 2. Misclassified to class 1.

Solution : Local k means For every class local centroids are calculated around the test data. SVMs

Support Vector Machines (SVM) classify new observations by constructing a linear decision boundary

Support Vector Machines (SVM) Decision boundary chosen to maximize the separation m between classes

SVM with multiple classes SVM is a binary classifier. What if there are more than two classes? Two methods: 1) One vs. Rest 2) Pairs One vs Rest Construct one SVM model for each class Each SVM separates one class from the rest

Support Vector Machines (SVM) What if data cannot be separated by a line? Kernel SVM: Separation may be easier in higher dimensions

Combining PCA with SVM 0 or not 0 1 or not 1 9 or not 9 Traditionally: Apply PCA globally to entire data Our approach: Separately apply PCA to each digit space This extracts the patterns from each digit class We can use different parameters for each digit group. PCA Model 1 Model 2 Model 10 SVM1 SVM2 SVM10 0 or not 0 largest positive distance from boundary 1 or not 1 9 or not 9

Some Challenges for kernel SVM It is not obvious what parameters to use when training multiple models Within each class, compute a corresponding sigma This gives a starting point for parameter selection How to obtain an approximate range of parameters for training?

Parameter selection for SVMs Using kNN, with k=5, sigma values for each class Error 1.25% using kNN different sigma on each model Error 1.2% is achieved with the averaged sigma = 3.9235

SVM + PCA results

Some misclassified digits (Based on local PCA + SVM, deskewed data) 6 to 0 3 to 5 8 to 2 2 to 7 7 to 3 8 to 9 7 to 2 9 to 4 4 to 9 4 to 9 Neural Nets

Neural Networks: Artificial Neuron The artificial neuron is a type of binary classifier whose inputs are the components of a vector, which are usually the outputs of other neurons. In the biological neuron, electrical signals from other neurons accumulate on the what is called the dendritic tree. If the charge reaches a critical level, the neuron sends a signal to other neurons connected to the end of it. In place of electrical signals, the artificial neuron uses the weighted sum of the inputs, and if the sum exceeds a certain threshold, the neuron outputs a value of 1, and 0 otherwise. In order to optimize the weights, the output function must be smooth, so we use an S-curve that lies between 0 and 1 called the sigmoid function. The “sharper” the curve, the closer the sigmoid approximates the strictly binary 0-1 output.

Neural Networks: Learning More complicated classifiers can be constructed by creating a layered network of artificial neurons, where the inputs of each neuron are the outputs of all the neurons in the previous layer. A neural network consists of a layer of input neurons, output neurons, and middle neurons which are called hidden layers. The hidden layers are in charge of recognizing different features of the data. (point out example) The pixels are fed through the input layer and the image classified according to the output neuron with the highest value. The values of the output neurons can be controlled by the manipulating the weights and thresholds of the neural net, and training the model involves optimizing the weights and thresholds so as to maximize the number of correct classifications on the training set. A scaled gradient algorithm is typically used for this.

Neural Networks: Results Classification rule for ensembles: majority voting The training of neural networks requires an initial guess for the values of the weights and thresholds. These are chosen randomly, so results will vary between different networks. What we do is train an ensemble of many models whose outputs are combined, resulting in a lower error rate than one individual model. The way we combine them is simply use majority voting. That is each digit is classified according to what a majority of the models say it is. Conclusions

Summary and Results Linear methods are not sufficient as the data is nonlinear. LDA did not work well for our data. Principal Component Analysis worked better than other dimensionality reduction methods. Best results were obtained with PCA values between 50 and 200 (55 being best) Deskewing improved results in general Best classifier for this data set is SVM

Results for MNIST

Questions?

Directions to Lunch

Choosing the optimum k Local PCA The optimum k should be chosen with cross validation The data set is split into a training and a test set. The algorithm is run on the test set with different k values. The k that gives the least misclassification is chosen. Local PCA For each of the class of digits the basis is found by PCA. Local PCA has ten bases instead of one global basis. Each of the test data point is projected into each of these ten bases.

Local Variance “Nice “ Parametric Data “Messy “ Non-Parametric Center Local k-Center “Nice “ Parametric Data “Messy “ Non-Parametric