Towards Science of DM Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch WCCI’08 Panel Discussion.

Slides:



Advertisements
Similar presentations
Visualization of the hidden node activities or hidden secrets of neural networks. Włodzisław Duch Department of Informatics Nicolaus Copernicus University,
Advertisements

Slides from: Doug Gray, David Poole
Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.
Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009,
NEURAL NETWORKS Perceptron
Meta-Learning: the future of data mining Włodzisław Duch & Co Department of Informatics, Nicolaus Copernicus University, Toruń, Poland School of Computer.
Support Vector Machines
Computational Intelligence: Methods and Applications Lecture 1 Organization and overview Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Heterogeneous Forests of Decision Trees Krzysztof Grąbczewski & Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland.
Ch. 4: Radial Basis Functions Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 based on slides from many Internet sources Longin.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Heterogeneous adaptive systems Włodzisław Duch & Krzysztof Grąbczewski Department of Informatics, Nicholas Copernicus University, Torun, Poland.
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
K-separability Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Torun, Poland School of Computer Engineering, Nanyang Technological.
Fuzzy rule-based system derived from similarity to prototypes Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Poland School.
Almost Random Projection Machine with Margin Maximization and Kernel Features Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus.
Transfer functions: hidden possibilities for better neural networks. Włodzisław Duch and Norbert Jankowski Department of Computer Methods, Nicholas Copernicus.
Towards comprehensive foundations of Computational Intelligence Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland.
Global Visualization of Neural Dynamics
Chapter 6: Multilayer Neural Networks
Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,
How to learn hard Boolean functions Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering,
Support Feature Machine for DNA microarray data Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland.
Supervised Learning Networks. Linear perceptron networks Multi-layer perceptrons Mixture of experts Decision-based neural networks Hierarchical neural.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CS Instance Based Learning1 Instance Based Learning.
Aula 4 Radial Basis Function Networks
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Radial Basis Function (RBF) Networks
Radial-Basis Function Networks
Radial Basis Function Networks
A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.
Introduction to AI Michael J. Watts
Meta-Uczenie Maszynowe Włodzisław Duch, Norbert Jankowski & Krzysztof Grąbczewski Katedra Informatyki Stosowanej, Uniwersytet Mikołaja Kopernika, Toruń.
Chapter 9 Neural Network.
IJCNN 2012 Competition: Classification of Psychiatric Problems Based on Saccades Włodzisław Duch 1,2, Tomasz Piotrowski 1 and Edward Gorzelańczyk 3 1 Department.
Computational Intelligence: Methods and Applications Lecture 37 Summary & review Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.
Meta-Learning: towards universal learning paradigms Włodzisław Duch & Co Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google:
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Chapter 6: Techniques for Predictive Modeling
Meta-Learning and learning in highly non-separable cases Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google:
Meta-Learning: towards universal learning paradigms Włodzisław Duch Norbert Jankowski, Krzysztof Grąbczewski & Co Department of Informatics, Nicolaus Copernicus.
Meta-Learning: towards universal learning paradigms Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W.
Computational Intelligence: Methods and Applications Lecture 36 Meta-learning: committees, sampling and bootstrap. Włodzisław Duch Dept. of Informatics,
Towards CI Foundations Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch WCCI’08 Panel Discussion.
How to learn highly non-separable data Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch ICAISC’08.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Neural network applications: The present and the future Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google:
Neural Networks Demystified by Louise Francis Francis Analytics and Actuarial Data Mining, Inc.
Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Włodzisław Duch Dept. of Informatics, UMK Google:
Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:
Meta-Learning and Learning in Highly Non-separable Cases Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google:
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Support Feature Machine for DNA microarray data
Tomasz Maszczyk and Włodzisław Duch Department of Informatics,
Chapter 3. Artificial Neural Networks - Introduction -
Neuro-Computing Lecture 4 Radial Basis Function Network
Visualization of the hidden node activities or hidden secrets of neural networks. Włodzisław Duch Department of Informatics Nicolaus Copernicus University,
How to learn highly non-separable data
Visualization of the hidden node activities or hidden secrets of neural networks. Włodzisław Duch Department of Informatics Nicolaus Copernicus University,
Heterogeneous adaptive systems
Presentation transcript:

Towards Science of DM Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch WCCI’08 Panel Discussion

What can we learn? Good part of CI is about learning. What can we learn? Neural networks are universal approximators and evolutionary algorithms solve global optimization problems – so everything can be learned? Not quite... Duda, Hart & Stork, Ch. 9, No Free Lunch + Ugly Duckling Theorems: Uniformly averaged over all target functions the expected error for all learning algorithms [predictions by economists] is the same. Averaged over all target functions no learning algorithm yields generalization error that is superior to any other. There is no problem-independent or “best” set of features. “Experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems.”

Data mining packages DM packages: Weka, Yale, RapidMiner, Orange, Knime... >180 packages on the-data-mine.com list! Hundreds of components... thousands of combinations... Our treasure box is full, although computer vision, BCI and other problems are not solved. We can data mine forever … and publish forever! Neural networks are universal approximators and evolutionary algorithms solve global optimization problems – so everything can be learned? Not quite... GhostMiner, data mining tools from our lab + Fujitsu:

Are we really so good? Surprise! Almost nothing can be learned using such tools!

What have we tried: SBM Similarity-Based Methods (SBMs) organized in a framework: p(C i |X;M) posterior classification probability or y(X;M) approximators, models M are parameterized in increasingly sophisticated way. Why? (Dis)similarity: more general than feature-based description, no need for vector spaces (structured objects), more general than fuzzy approach (F-rules are reduced to P-rules), includes kNN, MLPs, RBFs, separable function networks, SVMs, kernel methods and many others! Components => Models; systematic search selects optimal combination of parameters and procedures, opening different types of optimization channels, trying to discover appropriate bias for a given problem. Start from kNN, k=1, all data & features, Euclidean distance, end with a model that is a novel combination of procedures and parameterizations.

Transformation-based framework Extend SBM adding fine granulation of methods and relations between them to enable meta-learning by search in the model space. For example, transformations (layers) frequently do: linear projection: unsupervised - PCA, ICA … or supervised – FDA, LDA, linear SVM generate useful linear components; non-linear preprocessing transformation, ex. MLP; feature selector, based on information filter; matching pursuit network for signal decomposition; logical rules to handle unusual situations; evaluate similarity (RBF). DM requires more transformations!

Taxonomy - TF

Heterogeneous everything Homogenous systems: one type of “building blocks”, same type of decision borders, ex: neural networks, SVMs, decision trees, kNNs Committees combine many models together, but lead to complex models that are difficult to understand. Ockham razor: simpler systems are better. Discovering simplest class structures, inductive bias of the data, requires Heterogeneous Adaptive Systems (HAS). HAS examples: NN with different types of neuron transfer functions. k-NN with different distance functions for each prototype. Decision Trees with different types of test criteria. 1. Start from large networks, use regularization to prune. 2. Construct network adding nodes selected from a candidate pool. 3. Use very flexible functions, force them to specialize.

RBF for XOR Is RBF solution with 2 hidden Gaussians nodes possible? Typical architecture: 2 input – 2 Gaussians – 1 linear output, ML training 50% errors, but there is perfect separation - not a linear separation! Network knows the answer, but cannot say it... Single Gaussian output node may solve the problem. Output weights provide reference hyperplanes (red and green lines), not the separating hyperplanes like in case of MLP.

More meta-learning Meta-learning: learning how to learn, replace experts who search for best models making a lot of experiments. Search space of models is too large to explore it exhaustively, design system architecture to support knowledge-based search. Abstract view, uniform I/O, uniform results management. Directed acyclic graphs (DAG) of boxes representing scheme placeholders and particular models, interconnected through I/O. Configuration level for meta-schemes, expanded at runtime level. An exercise in software engineering for data mining!

Intemi, Intelligent Miner Meta-schemes: templates with placeholders May be nested; the role decided by the input/output types. Machine learning generators based on meta-schemes. Granulation level allows to create novel methods. Complexity control: Length + log(time) A unified meta-parameters description... InteMi, intelligent miner, coming “soon”.

How much can we learn? Linearly separable or almost separable problems are relatively simple – deform or add dimensions to make data separable. How to define “slightly non-separable”? There is only separable and the vast realm of the rest.

Boolean functions n=2, 16 functions, 12 separable, 4 not separable. n=3, 256 f, 104 separable (41%), 152 not separable. n=4, 64K=65536, only 1880 separable (3%) n=5, 4G, but << 1% separable... bad news! Existing methods may learn some non-separable functions, but most functions cannot be learned ! Example: n-bit parity problem; many papers in top journals. No off-the-shelf systems are able to solve such problems. For parity problems SVM may go below base rate! Such problems are solved only by special neural architectures or special classifiers – if the type of function is known. But parity is still trivial... solved by

kD case 3-bit functions: X=[b 1 b 2 b 3 ], from [0,0,0] to [1,1,1] f(b 1,b 2,b 3 ) and  f(b 1,b 2,b 3 ) are symmetric (color change) 8 cube vertices, 2 8 =256 Boolean functions. 0 to 8 red vertices: 1, 8, 28, 56, 70, 56, 28, 8, 1 functions. For arbitrary direction W index projection W. X gives: k=1 in 2 cases, all 8 vectors in 1 cluster (all black or all white) k=2 in 14 cases, 8 vectors in 2 clusters (linearly separable) k=3 in 42 cases, clusters B R B or W R W k=4 in 70 cases, clusters R W R W or W R W R Symmetrically, k=5-8 for 70, 42, 14, 2. Most logical functions have 4 or 5-separable projections. Learning = find best projection for each function. Number of k=1 to 4-separable functions is: 2, 102, 126 and of all functions may be learned using 3-separability.

3-bit parity in 2D and 3D Output is mixed, errors are at base level (50%), but in the hidden space... Conclusion: separability in the hidden space is perhaps too much to desire... inspection of clusters is sufficient for perfect classification; add second Gaussian layer to capture this activity; train second RBF on the data (stacking), reducing number of clusters.

Spying on networks After initial transformation, what still needs to be done? Conclusion: separability in the hidden space is perhaps too much to desire... rules, similarity or linear separation, depending on the case.

Parity n=9 Simple gradient learning; quality index shown below.

Biological justification Cortical columns may learn to respond to stimuli with complex logic resonating in different way. The second column will learn without problems that such different reactions have the same meaning: inputs x i and training targets y j. are same => Hebbian learning  W ij ~ x i y j => identical weights. Effect: same line y=W. X projection, but inhibition turns off one perceptron when the other is active. Simplest solution: oscillators based on combination of two neurons  (W. X-b) –  (W. X-b’) give localized projections! We have used them in MLP2LN architecture for extraction of logical rules from data. Note: k-sep. learning is not a multistep output neuron, targets are not known, same class vectors may appear in different intervals! We need to learn how to find intervals and how to assign them to classes; new algorithms are needed to learn it!