SVM-based Deep Stacking Networks

Slides:



Advertisements
Similar presentations
A brief review of non-neural-network approaches to deep learning
Advertisements

Artificial Neural Networks
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Information Fusion Yu Cai. Research Article “Comparative Analysis of Some Neural Network Architectures for Data Fusion”, Authors: Juan Cires, PA Romo,
Kumar Srijan ( ) Syed Ahsan( ). Problem Statement To create a Neural Networks based multiclass object classifier which can do rotation,
Hurieh Khalajzadeh Mohammad Mansouri Mohammad Teshnehlab
Machine Learning Chapter 4. Artificial Neural Networks
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
CLASSIFICATION: Ensemble Methods
Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,
Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.
MLSLP-2012 Learning Deep Architectures Using Kernel Modules (thanks collaborations/discussions with many people) Li Deng Microsoft Research, Redmond.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Today’s Lecture Neural networks Training
Neural networks and support vector machines
Big data classification using neural network
Semi-Supervised Learning Using Label Mean
CSSE463: Image Recognition Day 14
PREDICT 422: Practical Machine Learning
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Feedforward Networks
The Relationship between Deep Learning and Brain Function
Artificial Neural Networks
Deep Learning Amin Sobhani.
Data Mining, Neural Network and Genetic Programming
Data Mining, Neural Network and Genetic Programming
Chilimbi, et al. (2014) Microsoft Research
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
An Artificial Intelligence Approach to Precision Oncology
Learning Deep L0 Encoders
Yu-Feng Li 1, James T. Kwok2, Ivor W. Tsang3 and Zhi-Hua Zhou1
Other Classification Models: Neural Network
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Restricted Boltzmann Machines for Classification
Presenter: Chu-Song Chen
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks Dr. Peter Phillips.
Neural Networks CS 446 Machine Learning.
Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan.
Neuro-Computing Lecture 5 Committee Machine
Training Techniques for Deep Neural Networks
Deep Learning Qing LU, Siyuan CAO.
Attentional Neural Network: Feature Selection Using Cognitive Feedback
Machine Learning Week 1.
Bird-species Recognition Using Convolutional Neural Network
Machine Learning Today: Reading: Maria Florina Balcan
Introduction to Neural Networks
8-3 RRAM Based Convolutional Neural Networks for High Accuracy Pattern Recognition and Online Learning Tasks Z. Dong, Z. Zhou, Z.F. Li, C. Liu, Y.N. Jiang,
Neural Networks Geoff Hulten.
Outline Background Motivation Proposed Model Experimental Results
Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee
Ensemble learning.
Tuning CNN: Tips & Tricks
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Graph Neural Networks Amog Kamsetty January 30, 2019.
COSC 4335: Part2: Other Classification Techniques
INTRODUCTION TO Machine Learning 3rd Edition
Heterogeneous convolutional neural networks for visual recognition
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
A Novel Smoke Detection Method Using Support Vector Machine
Natalie Lang Tomer Malach
Deep learning enhanced Markov State Models (MSMs)
Learning and Memorization
PYTHON Deep Learning Prof. Muhammad Saeed.
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

SVM-based Deep Stacking Networks The Thirty-Third AAAI Conference on Artificial Intelligence Honolulu, Hawaii, USA. January 27–February 1, 2019 SVM-based Deep Stacking Networks Jingyuan Wang, Kai Feng and Junjie Wu Beihang University, Beijing, China

Motivation: How to build a deep network? Neural Network Proved to be powerful in numerous tasks For example: image classification, machine translation, trajectory prediction etc. Based on other models The alternative to neural network is worth trying For example: PCANet (Chan et al. 2015), gcForest (Zhi-Hua Zhou 2017), Deep Fisher Networks (Simonyan, Vedaldi, and Zisserman 2013) An illustration of the gcForest (Zhi-Hua Zhou 2017)

Related Works Stacking ensemble Limitation: Introduced by Wolpert (Wolpert 1992) Used in many real-world applications; Used to integrate strong base-learners (Jahrer et al. 2010) Researches focused on designing elegant meta-learners (Ting and Witten 1999; Rooney and Patterson 2007; Chen et al. 2014) Limitation: Very few works studied how to optimize multi-layer stacked base- learners as a whole. Performance of the ensemble depends on the diversity of base-learners, which is hard to measure (Sun and Zhou 2018). Deep Stacking Network frame Adopted in various applications (Deng et al. 2013; Li et al. 2015; Deng and Yu 2011) Structure is flexible, thus has mulitple variants (Hutchinson et al 2013; Zhang et al. 2016) Base blocks could be trained in a parallel way (Deng et al. 2013; Deng and Yu 2011) Limitation: Mainly based on neural networks, may be unsatisfactory on interpretability. Why not use other sorts of base learns, since one layer ANN is not the best shallow learner.

Support Vector Machine 𝜔 𝑇 𝑥+𝑏=0 Given training samples 𝑇= 𝑥 𝑘 , 𝑦 𝑘 𝑦 𝑘 ∈ −1, 1 , 𝑘=1,…,𝐾 , maximize the minimum distance from the hyperplane to 𝑇, 𝑚𝑎𝑥 𝜔,𝑏 2 𝜔 𝑠.𝑡. 𝑦 𝑘 𝜔 𝑇 𝑥 𝑘 +𝑏 ≥1 The optimization of SVM is convex. 𝐻 1 𝐻 2 2 𝜔

SVM-DSN Block We used one-vs-rest strategy to handle multi-class problem. For a problem with 𝑁 classes, we connected the input vector 𝑥 of a DSN block with 𝑁 binary SVM (called base-SVM) classifiers. … … … … 𝑥 Ω … … SVM1 SVM2 SVM 𝑁 concatenate output of 𝑁 SVMs … 𝑦 Here Ω =( 𝜔 1 𝑇 , 𝜔 2 𝑇 ,…, 𝜔 3 𝑇 )

Stacking Blocks SVM Denote the 𝑖-th base-SVM in the layer 𝑙 as 𝑠𝑣𝑚 𝑙,𝑖 , its output . SVM Blocks are stacked according to Deep Stacking Network. The input of layer 𝑙+1 is the joint of raw inputs and the output of layer 𝑙, i.e. An illustration of the DSN architecture (Deng, He and Gao 2013)

Block Training The objective function Efron bootstrap (Efron and Tibshirani 1994) to increase the diversity of base-SVMs.

Fine Tuning BLT: BP-like layered Tuning Virtual label

Objective function in fine tuning SVM&SVR: virtual labels are not guaranteed to be in {-1, 1}. It is still a convex optimization problem. Recursively calculate the partial derivative.

Model Properties Universal approximation: activation function in SVM-DSN is bounded and non-constant (Hornik 1991). Block level parallelization: For a DSN with 𝐿 blocks, the training of blocks could be deployed over 𝐿 processing units. Parallelable training of SVM: Using Parallel Support Vector Machines (Graf et al. 2005) An illustration of the Parallel Support Vector Machines (Graf et al 2005)

Model Properties Anti-saturation: the partial derivative for a neuron. For common activation function like sigmoid and ReLU, can be 0 or near 0 even if there is still much room for the optimization of . In the BLT of SVM-DSN, is calculated recursively, unless for all , the base-SVM in layer 𝑙 would not be saturated. For example but . The BLT can update using that sample 𝑘, but the BP algorithm can not, because .

Model Properties Interpretation: average confidence The change of “classifying plane” maps illustrate the feature extract process.

Experiments Experiments setup Data set: MNIST (LeCun et al. 1998) and IMDB sentiment classification data set (Mass rt al. 2011). Sample number: MNIST: 28 × 28, 60000 for training and validation, 10000 for testing; IMDB: 25000 for training and 25000 for testing. Network structure: 3-layer SVM-DSN, 200 base – SVMs per hidden layer; CNN feature extractor with SVM- DSN; 5-layer SVM-DSN, 1024-1024-512-256. Purpose: we use the second model to illustrate the compatibility of SVM-DSN with CNN feature extractor, the third to compare with ensemble models.

Experiments Result on MINST The general scenario stacking method in (Perlich and Swirszcz 2011) is used.

Conclusions In this paper, we rethink the build of deep network, and present a novel model SVM-DSN. It can take the advantage of both SVM and DSN: the good mathematical property from SVM and flexible structure from DSN. As showed in our paper, SVM-DSN has many advantageous properties including optimization and anti-saturation. The results showed that SVM-DSN is a competitive compare with tradition methods in various scenarios.

Thank You! Email: fengkai@buaa.edu.cn Webpage: http://www.bigscity.com/