Presentation is loading. Please wait.

Presentation is loading. Please wait.

K-separability Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Torun, Poland School of Computer Engineering, Nanyang Technological.

Similar presentations


Presentation on theme: "K-separability Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Torun, Poland School of Computer Engineering, Nanyang Technological."— Presentation transcript:

1 k-separability Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Torun, Poland School of Computer Engineering, Nanyang Technological University, Singapore Google: Duch ICANN 2006, Athens

2 PlanPlan What is the greatest challenge of CI? Learning all that can be learned! Surprise: our learning methods are not able to learn most (approximate) logical functions! SVMs, kernels and all that. Neural networks. Learning dynamics. New goal of learning: k-separability How to learn any function?

3 GhostMiner Philosophy There is no free lunch – provide different type of tools for knowledge discovery: decision tree, neural, neurofuzzy, similarity-based, SVM, committees. Provide tools for visualization of data. Support the process of knowledge discovery/model building and evaluating, organizing it into projects. We are building completely new tools ! Surprise! Almost nothing can be learned using such tools! GhostMiner, data mining tools from our lab + Fujitsu: http://www.fqspl.com.pl/ghostminer/ http://www.fqspl.com.pl/ghostminer/ Separate the process of model building (hackers) and knowledge discovery, from model use (lamers) => GhostMiner Developer & GhostMiner Analyzer

4 Easy problems Linearly separable in the original feature space: linear discrimination is sufficient (always worth trying!). Simple topological deformation of decision borders is sufficient – linear separation is then possible in extended (higher dimensional) or transformed spaces. This is frequently sufficient for pattern recognition problems. RBF/MLP networks with one hidden layer also solve such problems easily, but convergence/generalization for anything more complex than XOR is problematic. SVM adds new features to “flatten” the decision border: achieving larger margins/separability in the X+Z space.

5 Difficult problems Complex logic, disjoint “sandwiched” clusters. Networks with localized functions (RBF) need exponentially large number of nodes and do not generalize. Networks with non-local functions (MLP) may also need large number of nodes and will have problems with convergence. Continuous deformation of decision borders is not sufficient to achieve separability (or near-separability). This is typical in logical problems, AI reasoning, real perhaps in perception, object recognition, text analysis, bioinformatics... Boolean functions: for n bits there are K=2 n binary vectors B i that can be represented as vertices of n-dimensional hypercube. Each Boolean function is identified by K bits. BoolF(B i ) = 0 or 1 for i=1..K, for 2 K Boolean functions. Ex: n=2 functions, vectors {00,01,10,11}, Boolean functions {0000, 0001... 1111}, are identified by decimal numbers 0 to 15.

6 Lattice projection for n=3, 4 Lattice projection of a hypercube is defined in W 1 =[1,1,..1] direction and orthogonal direction W 2 that maximizes separation of the projected points with fixed number of 1 bits. Projection on 111... 111 gives clusters with 0, 1, 2... n bits.

7 Boolean functions n=2, 16 functions, 12 separable, 4 not separable. n=3, 256 f, 104 separable (41%), 152 not separable. n=4, 64K=65536, only 1880 separable (3%) n=5, 4G, but only << 1% separable... bad news! Existing methods may learn some non-separable functions, but most functions cannot be learned ! Example: n-bit parity problem; all nearest neighbors are from the opposite class! Many papers on learning parity in top journals. For all parity problems SVM is below base rate! No off-the-shelf systems are able to solve such problems. Such problems are solved only by special neural architectures or special classifiers – if the type of function is known. Ex: parity problems are solved by

8 What feedforward NN really do? Vector mappings from the input space to hidden space(s) and to the output space. Hidden-Output mapping done by perceptrons. A single hidden layer case is analyzed below. T = {X i } training data, N-dimensional. H = {h j (X i )}X image in the hidden space, j =1.. N H -dim. Y = {y k {h(X i )}X image in the output space, k =1.. N C -dim. ANN goal: scatterograms of T in the hidden space should be linearly separable; internal representations will determine network generalization capabilities and other properties.

9 What happens inside? Many types of internal representations may look identical from outside, but generalization depends on them. Classify different types of internal representations. Take permutational invariance into account: equivalent internal representations may be obtained by re-numbering hidden nodes. Good internal representations should form compact clusters in the internal space. Check if the representations form separable clusters. Discover poor representations and stop training. Analyze adaptive capacity of networks......

10 RBF for XOR Is RBF solution with 2 hidden Gaussians nodes possible? Typical architecture: 2 input – 2 Gauss – 2 linear. Perfect separation, but not a linear separation! 50% errors. Single Gaussian output node solves the problem. Output weights provide reference hyperplanes (red and green lines), not the separating hyperplanes like in case of MLP. Output codes (ECOC): 10 or 01 for green, and 00 for red.

11 3-bit parity For RBF parity problems are difficult; 8 nodes solution: 1) Output activity; 2) reduced output, summing activity of 4 nodes. 3) Hidden 8D space activity, near ends of coordinate versors. 4) Parallel coordinate representation. 8 nodes solution has zero generalization, 50% errors in tests.

12 3-bit parity in 2D and 3D Output is mixed, errors are at base level (50%), but in the hidden space... quite easy to separate, but not linearly. Conclusion: separability is perhaps too much to desire... inspection of clusters is sufficient for perfect classification; add second Gaussian layer to capture this activity; just train second RBF on this data (stacking)!

13 Goal of learning Linear separation is a good goal only if simple topological deformation of decision borders is sufficient; then neural networks or SVMs may solve the problem. Difficult problems: disjoint clusters, complex logic. Linear separation is difficult, set an easier goal. Linear separation: projection on 2 half-lines in the kernel or hidden activity space: line y=WX, with y 0 for class +. Simplest extension: separation into k-intervals instead of 2; other extensions: projection on a chessboard or other pattern. For parity: find direction W with minimum # of intervals, y=W. X

14 Some remarks Biological justification: neuron is not a good model for elementary processor; cortical columns may learn to stimuli with complex logic react resonating in different way; the second column will learn without problems that such different reactions have the same meaning. Algorithms: Some results on perceptron capacity for multistep neurons are known and projection on a line may be achieved using such neurons, but each step is a different class, known targets. k-rep learning is not a multistep output neuron, targets are not known, same class vectors may appear in different intervals! We need to learn how to find intervals and how to assign them to classes; new algorithms are needed to learn it!

15 k-sep learning Try to find lowest k with good solution, start from k=2. Assume k=2 (linear separability), try to find good solution; if k=2 is not sufficient, try k=3; two possibilities are C +,C ,C + and C , C +, C  this requires only one interval for the middle class; if k<4 is not sufficient, try k=4; two possibilities are C +, C , C +, C  and C , C +, C , C + this requires one closed and one open interval. Network solution is equivalent to optimization of specific cost function. Simple backpropagation solved almost all n=4 problems for k=2-5 finding lowest k with such architecture!

16 Neural k-separability Can one learn all Boolean functions? Problems may be classified as 2-separable (linear separability); non separable problems may be broken into k-separable, k>2. Blue: sigmoidal neurons with threshold, brown – linear neurons. X1X1 X2X2 X3X3 X4X4 y=W.Xy=W.X +1+1 11 +1+1 11 (W.X+)(W.X+) (W.X+)(W.X+) +1+1 +1+1 +1+1 +1+1 (W.X+)(W.X+) Neural architecture for k=4 intervals. Will BP learn?

17 Even better solution? What is needed to learn Boolean functions? cluster non-local areas in the X space, use W. X capture local clusters after transformation, use G(W. X  ) SVM cannot solve this problem! Number of directions W that should be considered grows exponentially with size of the problem n. Constructive neural network solution: 1. Train the first neuron using G(W. X  ) transfer function on whole data T, capture the largest pure cluster T C. 2. Train next neuron on reduced data T 1 =T  T C 3. Repeat until all data is handled; they creates transform. X=> H 4. Use linear transformation H => Y for classification. Tests: ongoing, but look well!

18 SummarySummary Difficult learning problems arise when non-connected clusters are assigned to the same class. No off-shelf classifiers are able to learn difficult Boolean functions. Brain have no problem to learn such mapping, ex. different character representation to the same speech sound: a, A, あ, ア, , Զ, א, அ and others... Visualization of activity of the hidden neurons shows that frequently perfect but non-separable solutions are found despite base-rate outputs. Linear separability is not the best goal of learning. Other targets that allow for easy handling of final non-linearity should be defined.

19 Summary cont. Simplest extension is to isolate non-linearity in form of k intervals. k-separability allows to break non-separable problems into well defined classes; quite different than VC theory. For Boolean problems k-separability finds simplest data model with linear projection and k parameters defining intervals. k-separability may be used in kernel space; many algorithms should be explored. Tests with simplest backpropagation optimization learned difficult Boolean functions. Prospects for systems that will learn all Boolean functions are good!

20 Thank you for lending your ears... Google: Duch => Papers


Download ppt "K-separability Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Torun, Poland School of Computer Engineering, Nanyang Technological."

Similar presentations


Ads by Google