SVM-based Deep Stacking Networks

Slides:

Advertisements

Similar presentations

A brief review of non-neural-network approaches to deep learning

Advertisements

Artificial Neural Networks

CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.

Information Fusion Yu Cai. Research Article “Comparative Analysis of Some Neural Network Architectures for Data Fusion”, Authors: Juan Cires, PA Romo,

Kumar Srijan ( ) Syed Ahsan( ). Problem Statement To create a Neural Networks based multiclass object classifier which can do rotation,

Hurieh Khalajzadeh Mohammad Mansouri Mohammad Teshnehlab

Machine Learning Chapter 4. Artificial Neural Networks

LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

CLASSIFICATION: Ensemble Methods

Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,

Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.

MLSLP-2012 Learning Deep Architectures Using Kernel Modules (thanks collaborations/discussions with many people) Li Deng Microsoft Research, Redmond.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Today’s Lecture Neural networks Training

Neural networks and support vector machines

Big data classification using neural network

Semi-Supervised Learning Using Label Mean

CSSE463: Image Recognition Day 14

PREDICT 422: Practical Machine Learning

Learning Deep Generative Models by Ruslan Salakhutdinov

Deep Feedforward Networks

The Relationship between Deep Learning and Brain Function

Artificial Neural Networks

Deep Learning Amin Sobhani.

Data Mining, Neural Network and Genetic Programming

Data Mining, Neural Network and Genetic Programming

Chilimbi, et al. (2014) Microsoft Research

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

An Artificial Intelligence Approach to Precision Oncology

Learning Deep L0 Encoders

Yu-Feng Li 1, James T. Kwok2, Ivor W. Tsang3 and Zhi-Hua Zhou1

Other Classification Models: Neural Network

Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Restricted Boltzmann Machines for Classification

Presenter: Chu-Song Chen

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Neural Networks Dr. Peter Phillips.

Neural Networks CS 446 Machine Learning.

Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan.

Neuro-Computing Lecture 5 Committee Machine

Training Techniques for Deep Neural Networks

Deep Learning Qing LU, Siyuan CAO.

Attentional Neural Network: Feature Selection Using Cognitive Feedback

Machine Learning Week 1.

Bird-species Recognition Using Convolutional Neural Network

Machine Learning Today: Reading: Maria Florina Balcan

Introduction to Neural Networks

8-3 RRAM Based Convolutional Neural Networks for High Accuracy Pattern Recognition and Online Learning Tasks Z. Dong, Z. Zhou, Z.F. Li, C. Liu, Y.N. Jiang,

Neural Networks Geoff Hulten.

Outline Background Motivation Proposed Model Experimental Results

Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee

Ensemble learning.

Tuning CNN: Tips & Tricks

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Graph Neural Networks Amog Kamsetty January 30, 2019.

COSC 4335: Part2: Other Classification Techniques

INTRODUCTION TO Machine Learning 3rd Edition

Heterogeneous convolutional neural networks for visual recognition

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

A Novel Smoke Detection Method Using Support Vector Machine

Natalie Lang Tomer Malach

Deep learning enhanced Markov State Models (MSMs)

Learning and Memorization

PYTHON Deep Learning Prof. Muhammad Saeed.

Patterson: Chap 1 A Review of Machine Learning

Presentation transcript:

SVM-based Deep Stacking Networks The Thirty-Third AAAI Conference on Artificial Intelligence Honolulu, Hawaii, USA. January 27–February 1, 2019 SVM-based Deep Stacking Networks Jingyuan Wang, Kai Feng and Junjie Wu Beihang University, Beijing, China

Motivation: How to build a deep network? Neural Network Proved to be powerful in numerous tasks For example: image classification, machine translation, trajectory prediction etc. Based on other models The alternative to neural network is worth trying For example: PCANet (Chan et al. 2015), gcForest (Zhi-Hua Zhou 2017), Deep Fisher Networks (Simonyan, Vedaldi, and Zisserman 2013) An illustration of the gcForest (Zhi-Hua Zhou 2017)

Related Works Stacking ensemble Limitation: Introduced by Wolpert (Wolpert 1992) Used in many real-world applications; Used to integrate strong base-learners (Jahrer et al. 2010) Researches focused on designing elegant meta-learners (Ting and Witten 1999; Rooney and Patterson 2007; Chen et al. 2014) Limitation: Very few works studied how to optimize multi-layer stacked base- learners as a whole. Performance of the ensemble depends on the diversity of base-learners, which is hard to measure (Sun and Zhou 2018). Deep Stacking Network frame Adopted in various applications (Deng et al. 2013; Li et al. 2015; Deng and Yu 2011) Structure is flexible, thus has mulitple variants (Hutchinson et al 2013; Zhang et al. 2016) Base blocks could be trained in a parallel way (Deng et al. 2013; Deng and Yu 2011) Limitation: Mainly based on neural networks, may be unsatisfactory on interpretability. Why not use other sorts of base learns, since one layer ANN is not the best shallow learner.

Support Vector Machine 𝜔 𝑇 𝑥+𝑏=0 Given training samples 𝑇= 𝑥 𝑘 , 𝑦 𝑘 𝑦 𝑘 ∈ −1, 1 , 𝑘=1,…,𝐾 , maximize the minimum distance from the hyperplane to 𝑇, 𝑚𝑎𝑥 𝜔,𝑏 2 𝜔 𝑠.𝑡. 𝑦 𝑘 𝜔 𝑇 𝑥 𝑘 +𝑏 ≥1 The optimization of SVM is convex. 𝐻 1 𝐻 2 2 𝜔

SVM-DSN Block We used one-vs-rest strategy to handle multi-class problem. For a problem with 𝑁 classes, we connected the input vector 𝑥 of a DSN block with 𝑁 binary SVM (called base-SVM) classifiers. … … … … 𝑥 Ω … … SVM1 SVM2 SVM 𝑁 concatenate output of 𝑁 SVMs … 𝑦 Here Ω =( 𝜔 1 𝑇 , 𝜔 2 𝑇 ,…, 𝜔 3 𝑇 )

Stacking Blocks SVM Denote the 𝑖-th base-SVM in the layer 𝑙 as 𝑠𝑣𝑚 𝑙,𝑖 , its output . SVM Blocks are stacked according to Deep Stacking Network. The input of layer 𝑙+1 is the joint of raw inputs and the output of layer 𝑙, i.e. An illustration of the DSN architecture (Deng, He and Gao 2013)

Block Training The objective function Efron bootstrap (Efron and Tibshirani 1994) to increase the diversity of base-SVMs.

Fine Tuning BLT: BP-like layered Tuning Virtual label

Objective function in fine tuning SVM&SVR: virtual labels are not guaranteed to be in {-1, 1}. It is still a convex optimization problem. Recursively calculate the partial derivative.

Model Properties Universal approximation: activation function in SVM-DSN is bounded and non-constant (Hornik 1991). Block level parallelization: For a DSN with 𝐿 blocks, the training of blocks could be deployed over 𝐿 processing units. Parallelable training of SVM: Using Parallel Support Vector Machines (Graf et al. 2005) An illustration of the Parallel Support Vector Machines (Graf et al 2005)

Model Properties Anti-saturation: the partial derivative for a neuron. For common activation function like sigmoid and ReLU, can be 0 or near 0 even if there is still much room for the optimization of . In the BLT of SVM-DSN, is calculated recursively, unless for all , the base-SVM in layer 𝑙 would not be saturated. For example but . The BLT can update using that sample 𝑘, but the BP algorithm can not, because .

Model Properties Interpretation: average confidence The change of “classifying plane” maps illustrate the feature extract process.

Experiments Experiments setup Data set: MNIST (LeCun et al. 1998) and IMDB sentiment classification data set (Mass rt al. 2011). Sample number: MNIST: 28 × 28, 60000 for training and validation, 10000 for testing; IMDB: 25000 for training and 25000 for testing. Network structure: 3-layer SVM-DSN, 200 base – SVMs per hidden layer; CNN feature extractor with SVM- DSN; 5-layer SVM-DSN, 1024-1024-512-256. Purpose: we use the second model to illustrate the compatibility of SVM-DSN with CNN feature extractor, the third to compare with ensemble models.

Experiments Result on MINST The general scenario stacking method in (Perlich and Swirszcz 2011) is used.

Conclusions In this paper, we rethink the build of deep network, and present a novel model SVM-DSN. It can take the advantage of both SVM and DSN: the good mathematical property from SVM and flexible structure from DSN. As showed in our paper, SVM-DSN has many advantageous properties including optimization and anti-saturation. The results showed that SVM-DSN is a competitive compare with tradition methods in various scenarios.

Thank You! Email: fengkai@buaa.edu.cn Webpage: http://www.bigscity.com/