How to learn hard Boolean functions Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering,

Slides:

Advertisements

Similar presentations

Visualization of the hidden node activities or hidden secrets of neural networks. Włodzisław Duch Department of Informatics Nicolaus Copernicus University,

Advertisements

Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009,

NEURAL NETWORKS Perceptron

Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Structure learning with deep neuronal networks 6 th Network Modeling Workshop, 6/6/2013 Patrick Michl.

Machine Learning Neural Networks

Lecture 14 – Neural Networks

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Heterogeneous adaptive systems Włodzisław Duch & Krzysztof Grąbczewski Department of Informatics, Nicholas Copernicus University, Torun, Poland.

RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.

K-separability Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Torun, Poland School of Computer Engineering, Nanyang Technological.

Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)

Almost Random Projection Machine with Margin Maximization and Kernel Features Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus.

Coloring black boxes: visualization of neural network decisions Włodzisław Duch School of Computer Engineering, Nanyang Technological University, Singapore,

Support Vector Neural Training Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering,

Transfer functions: hidden possibilities for better neural networks. Włodzisław Duch and Norbert Jankowski Department of Computer Methods, Nicholas Copernicus.

Global Visualization of Neural Dynamics

Support Feature Machine for DNA microarray data Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland.

MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Dan Simon Cleveland State University

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

Radial Basis Function (RBF) Networks

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

An Introduction to Support Vector Machines Martin Law.

Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences

A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.

Efficient Model Selection for Support Vector Machines

Artificial Neural Networks

Biointelligence Laboratory, Seoul National University

Chapter 3 Neural Network Xiu-jun GONG (Ph. D) School of Computer Science and Technology, Tianjin University

IJCNN 2012 Competition: Classification of Psychiatric Problems Based on Saccades Włodzisław Duch 1,2, Tomasz Piotrowski 1 and Edward Gorzelańczyk 3 1 Department.

NEURAL NETWORKS FOR DATA MINING

Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.

An Introduction to Support Vector Machines (M. Law)

Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Towards CI Foundations Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch WCCI’08 Panel Discussion.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.

Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:

Chapter 2 Single Layer Feedforward Networks

Towards Science of DM Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch WCCI’08 Panel Discussion.

Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Włodzisław Duch Dept. of Informatics, UMK Google:

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Each neuron has a threshold value Each neuron has weighted inputs from other neurons The input signals form a weighted sum If the activation level exceeds.

Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:

Machine Learning Supervised Learning Classification and Regression

Neural networks and support vector machines

CSSE463: Image Recognition Day 14

Support Feature Machine for DNA microarray data

Deep Feedforward Networks

Data Mining, Neural Network and Genetic Programming

Computational Intelligence: Methods and Applications

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Structure learning with deep autoencoders

Tomasz Maszczyk and Włodzisław Duch Department of Informatics,

Projection of network outputs

Neuro-Computing Lecture 4 Radial Basis Function Network

Visualization of the hidden node activities or hidden secrets of neural networks. Włodzisław Duch Department of Informatics Nicolaus Copernicus University,

Artificial Intelligence Chapter 3 Neural Networks

Visualization of the hidden node activities or hidden secrets of neural networks. Włodzisław Duch Department of Informatics Nicolaus Copernicus University,

Support Vector Neural Training

Heterogeneous adaptive systems

Artificial Intelligence Chapter 3 Neural Networks

Presentation transcript:

How to learn hard Boolean functions Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering, Nanyang Technological University, Singapore Google: Duch Polioptimization, 6/2006

PlanPlan Problem: learning systems are not able to learn almost any functions! Learning = adaptation of model parameters. Linear discrimination, Support Vector Machines and kernels. Neural networks. What happens in hidden space? k-separability How to learn any function?

GhostMiner Philosophy There is no free lunch – provide different type of tools for knowledge discovery: decision tree, neural, neurofuzzy, similarity-based, SVM, committees. Provide tools for visualization of data. Support the process of knowledge discovery/model building and evaluating, organizing it into projects. We are building completely new tools ! Surprise! Almost nothing can be learned using such tools! GhostMiner, data mining tools from our lab + Fujitsu: Separate the process of model building (hackers) and knowledge discovery, from model use (lamers) => GhostMiner Developer & GhostMiner Analyzer

Easy and difficult problems Linear separation: good goal if simple topological deformation of decision borders is sufficient. Linear separation of such data is possible in higher dimensional spaces; this is frequently the case in pattern recognition problems. RBF/MLP networks with one hidden layer solve such problems. Difficult problems: disjoint clusters, complex logic. Continuous deformation is not sufficient; networks with localized functions need exponentially large number of nodes. This is typical in AI problems, real perception, object recognition, text analysis, bioinformatics, logical problems... Boolean functions: for n bits there are K=2 n binary vectors that can be represented as vertices of n-dimensional hypercube. Each Boolean function is identified by K bits. BoolF(B i ) = 0 or 1 for i=1..K, for 2 K Boolean functions. Ex: n=2 functions, vectors {00,01,10,11}, Boolean functions {0000, }, decimal numbers 0 to 15.

Lattice projection for n=3, 4 For normalized data X i  [0,1] FDA projection is close to the lattice projection, defined as W 1 =[1,1,..1] direction and W 2 maximizing separation of the points with fixed number of 1 bits. Projection on gives clusters with 0, 1, 2... n bits.

Boolean functions n=2, 16 functions, 12 separable, 4 not separable. n=3, 256 f, 104 separable (41%), 152 not separable. n=4, 64K=65536, only 1880 separable (3%) n=5, 4G, but << 1% separable... bad news! Existing methods may learn some non-separable functions, but most functions cannot be learned ! Example: n-bit parity problem; many papers in top journals. No off-the-shelf systems are able to solve such problems. For all parity problems SVM is below base rate! Such problems are solved only by special neural architectures or special classifiers – if the type of function is known. Ex: parity problems are solved by

Linear discrimination In the feature space X find direction W that separates data into g(X)= W  X > , with fixed W, defines a half-space. Frequently a single hyperplane (projection on a line) is sufficient to separate data, if not find a better space (usually more features). 1/||W|| g(X)> +1 g(X)<  1 g(X)=+1 g(X)=  1 y=W. X

LDA in larger space Suppose that strongly non-linear borders are needed. Use LDA, just add some new dimensions! Add to input X i 2, and products X i X j, as new features. Example: 2D => 5D case {X 1, X 2, X 1 2, X 2 2, X 1 X 2 } But the number of such tensor products grows exponentially. Fig. 4.1 Hasti et al.

How to add new dimensions? In the space defined by data expand W in input vectors: Makes sense, since a component W Z of W=W Z +W X that does not belong to the space spanned by X (i) vectors has no influence on the discrimination process, because W Z T X=0. Insert W in the discriminant function: Transform X to a new space Great! Discriminant g(X) has not changed, except that K is now defined in the  space.  is not needed, just a scalar product K(X,X’), called “kernel”.

Maximization of margin Among all discriminating hyperplanes there is one defined by support vectors that is clearly better.

SVM SVM = LDA in the space defined by kernels + optimization that includes maximization of margins (min. of ||W||), focusing on vectors close to decision borders. Problem for Bayesian statistics: what data should be used for training? Local priors and conditional distributions work better, but how local should they be? SVM: discrimination based on cases close to decision border. Kernels may be sophisticated procedures to evaluate similarity of texts, molecules, DNA strings etc. Any method may be improved by moving to a kernel space! Even random projection to high-dim. space works well.

Gaussian kernels Gaussian kernels work quite well, giving for Gaussian mixtures close to optimal Bayesian errors. Solution requires continuous deformation of decision borders and is therefore rather easy. 4-deg. polynomial kernel is slightly worse then a Gaussian kernel, C=1. In the kernel space decision borders are flat!

Neural networks: thyroid screening Garavan Institute, Sydney, Australia 15 binary, 6 continuous Training: Validate: l Determine important clinical factors l Calculate prob. of each diagnosis. Hidden units Final diagnoses TSH T4U Clinical findings Age sex … T3 TT4 TBG Normal Hyperthyroid Hypothyroid

Learning in neural networks MLP/RBF: first fast MSE reduction, very slow later. Typical MSE(t) learning curve: after 10 iterations almost all work is done, but the final convergence is achieved only after a very long process, about 1000 iterations. What is going on?

Learning trajectories Take weights W i from iterations i=1..K; PCA on W i covariance matrix captures 95-95% variance for most data, so error function in 2D shows realistic learning trajectories. Instead of local minima large flat valleys are seen – why? Data far from decision borders has almost no influence, the main reduction of MSE is achieved by increasing ||W||, sharpening sigmoidal functions. Papers by M. Kordos & W. Duch

Selecting Support Vectors Active learning: if contribution to the parameter change is negligible remove the vector from training set. If the difference is sufficiently small the pattern X will have negligible influence on the training process and may be removed from the training. Conclusion: select vectors with  W (X)>  min, for training. 2 problems: possible oscillations and strong influence of outliers. Solution: adjust  min dynamically to avoid oscillations; remove also vectors with  W (X)>1  min =  max

SVNT algorithm Initialize the network parameters W, set  =0.01,  min =0, set SV=T. Until no improvement is found in the last N last iterations do Optimize network parameters for N opt steps on SV data. Run feedforward step on T to determine overall accuracy and errors, take SV={X|  (X)  [  min,1  min ]}. If the accuracy increases: compare current network with the previous best one, choose the better one as the current best increase  min =  min  and make forward step selecting SVs If the number of support vectors |SV| increases: decrease  min  min  ; decrease  =  /1.2 to avoid large changes

SVNT XOR solution

Satellite image data Multi-spectral values of pixels in the 3x3 neighborhoods in section 82x100 of an image taken by the Landsat Multi-Spectral Scanner; intensities = 0-255, training has 4435 samples, test 2000 samples. Central pixel in each neighborhood is red soil (1072), cotton crop (479), grey soil (961), damp grey soil (415), soil with vegetation stubble (470), and very damp grey soil (1038 training samples). Strong overlaps between some classes. System and parameters Train accuracy Test accuracy SVNT MLP, 36 nodes,  = SVM Gaussian kernel (optimized) RBF, Statlog result MLP, Statlog result C4.5 tree

Satellite image data – MDS outputs

Hypothyroid data 2 years real medical screening tests for thyroid diseases, 3772 cases with 93 primary hypothyroid and 191 compensated hypothyroid, the remaining 3488 cases are healthy; 3428 test, similar class distribution. 21 attributes (15 binary, 6 continuous) are given, but only two of the binary attributes (on thyroxine, and thyroid surgery) contain useful information, therefore the number of attributes has been reduced to 8. Method % train % test C-MLP2LN rules MLP+SCG, 4 neurons SVM Minkovsky opt kernel MLP+SCG, 4 neur, 67 SV MLP+SCG, 4 neur, 45 SV MLP+SCG, 12 neur Cascade correlation MLP+backprop SVM Gaussian kernel

Hypothyroid data

What feedforward NN really do? Vector mappings from the input space to hidden space(s) and to the output space. Hidden-Output mapping done by perceptrons. A single hidden layer case is analyzed below. T = {X i } training data, N-dimensional. H = {h j (X i )}X image in the hidden space, j =1.. N H -dim. Y = {y k {h(X i )}X image in the output space, k =1.. N C -dim. ANN goal: scatterograms of T in the hidden space should be linearly separable; internal representations will determine network generalization capabilities and other properties.

What happens inside? Many types of internal representations may look identical from outside, but generalization depends on them. Classify different types of internal representations. Take permutational invariance into account: equivalent internal representations may be obtained by re-numbering hidden nodes. Good internal representations should form compact clusters in the internal space. Check if the representations form separable clusters. Discover poor representations and stop training. Analyze adaptive capacity of networks......

RBF for XOR Is RBF solution with 2 hidden Gaussians nodes possible? Typical architecture: 2 input – 2 Gauss – 2 linear. Perfect separation, but not a linear separation! 50% errors. Single Gaussian output node solves the problem. Output weights provide reference hyperplanes (red and green lines), not the separating hyperplanes like in case of MLP. Output codes (ECOC): 10 or 01 for green, and 00 for red.

3-bit parity For RBF parity problems are difficult; 8 nodes solution: 1) Output activity; 2) reduced output, summing activity of 4 nodes. 3) Hidden 8D space activity, near ends of coordinate versors. 4) Parallel coordinate representation. 8 nodes solution has zero generalization, 50% errors in tests.

3-bit parity in 2D and 3D Output is mixed, errors are at base level (50%), but in the hidden space... Conclusion: separability is perhaps too much to desire... inspection of clusters is sufficient for perfect classification; add second Gaussian layer to capture this activity; just train second RBF on this data (stacking)!

Goal of learning Linear separation: good goal if simple topological deformation of decision borders is sufficient. Linear separation of such data is possible in higher dimensional spaces; this is frequently the case in pattern recognition problems. RBF/MLP networks with one hidden layer solve the problem. Difficult problems: disjoint clusters, complex logic. Continuous deformation is not sufficient; networks with localized functions need exponentially large number of nodes. This is typical in AI problems, real perception, object recognition, text analysis, bioinformatics... Linear separation is too difficult, set an easier goal. Linear separation: projection on 2 half-lines in the kernel space: line y=WX, with y 0 for class +. Simplest extension: separation into k-intervals. For parity: find direction W with minimum # of intervals, y=W. X

k-separabilityk-separability Can one learn all Boolean functions? Problems may be classified as 2-separable (linear separability); non separable problems may be broken into k-separable, k>2. Blue: sigmoidal neurons with threshold, brown – linear neurons. X1X1 X2X2 X3X3 X4X4 y=W.Xy=W.X +1+1 1 11 (W.X+)(W.X+) (W.X+)(W.X+) (W.X+)(W.X+) Neural architecture for k=4 intervals.

k-sep learning Try to find lowest k with good solution, start from k=2. Assume k=2 (linear separability), try to find good solution; if k=2 is not sufficient, try k=3; two possibilities are C +,C ,C + and C , C +, C  this requires only one interval for the middle class; if k<4 is not sufficient, try k=4; two possibilities are C +, C , C +, C  and C , C +, C , C + this requires one closed and one open interval. Network solution is equivalent to optimization of specific cost function. Simple backpropagation solved almost all n=4 problems for k=2-5 finding lowest k with such architecture!

A better solution? What is needed to learn Boolean functions? cluster non-local areas in the X space, use W. X capture local clusters after transformation, use G(W. X  ) SVM cannot solve this problem! Number of directions W that should be considered grows exponentially with size of the problem n. Constructive neural network solution: 1. Train the first neuron using G(W. X  ) transfer function on whole data T, capture the largest pure cluster T C. 2. Train next neuron on reduced data T 1 =T  T C 3. Repeat until all data is handled; they creates transform. X=> H 4. Use linear transformation H => Y for classification.

SummarySummary Difficult learning problems arise when non-connected clusters are assigned to the same class. No off-shelf classifiers are able to learn difficult Boolean functions. Visualization of activity of the hidden neurons shows that frequently perfect but non-separable solutions are found despite base-rate outputs. Linear separability is not the best goal of learning, other targets that allow for easy handling of final non-linearities should be defined. Simplest extension is to isolate non-linearity in form of k intervals. k-separability allows to break non-separable problems into well defined classes. For Boolean problems k-separability finds simplest data model with linear projection and k parameters defining intervals. Tests with simplest backpropagation optimization learned difficult Boolean functions. k-separability may be used in kernel space. Prospects for systems that will learn all Boolean functions are good!

Thank you for lending your ears... Google: Duch => Papers