Data Classification with the Radial Basis Function Network Based on a Novel Kernel Density Estimation Algorithm Yen-Jen Oyang Department of Computer Science.

Slides:



Advertisements
Similar presentations
Principles of Density Estimation
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Random Forest Predrag Radenković 3237/10
Fast Algorithms For Hierarchical Range Histogram Constructions
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)
Lecture 3 Nonparametric density estimation and classification
Pattern Recognition and Machine Learning
Radial Basis Functions
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Evaluating Hypotheses
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
A Study of the Relationship between SVM and Gabriel Graph ZHANG Wan and Irwin King, Multimedia Information Processing Laboratory, Department of Computer.
Experimental Evaluation
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
CS Instance Based Learning1 Instance Based Learning.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Radial-Basis Function Networks
Radial Basis Function Networks
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Efficient Model Selection for Support Vector Machines
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Fuzzy Entropy based feature selection for classification of hyperspectral data Mahesh Pal Department of Civil Engineering National Institute of Technology.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 E. Fatemizadeh Statistical Pattern Recognition.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
컴퓨터 과학부 김명재.  Introduction  Data Preprocessing  Model Selection  Experiments.
Linear Models for Classification
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Robust Kernel Density Estimation by Scaling and Projection in Hilbert Space Presented by: Nacer Khalil.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Kernel Methods Arie Nakhmani. Outline Kernel Smoothers Kernel Density Estimators Kernel Density Classifiers.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Kansas State University Department of Computing and Information Sciences CIS 890: Special Topics in Intelligent Systems Wednesday, November 15, 2000 Cecil.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Ch8: Nonparametric Methods
K Nearest Neighbor Classification
Data Mining Practical Machine Learning Tools and Techniques
Hidden Markov Models Part 2: Algorithms
Neuro-Computing Lecture 4 Radial Basis Function Network
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
Support Vector Machine _ 2 (SVM)
Machine Learning: UNIT-4 CHAPTER-1
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Data Classification with the Radial Basis Function Network Based on a Novel Kernel Density Estimation Algorithm Yen-Jen Oyang Department of Computer Science and Information Engineering National Taiwan University

An Example of Data Classification DataClassDataClassDataClass ( 15,33 ) O ( 18,28 ) × ( 16,31 ) O ( 9,23 ) × ( 15,35 ) O ( 9,32 ) × ( 8,15 ) × ( 17,34 ) O ( 11,38 ) × ( 11,31 ) O ( 18,39 ) × ( 13,34 ) O ( 13,37 ) × ( 14,32 ) O ( 19,36 ) × ( 18,32 ) O ( 25,18 ) × ( 10,34 ) × ( 16,38 ) × ( 23,33 ) × ( 15,30 ) O ( 12,33 ) O ( 21,28 ) × ( 13,22 ) ×

Distribution of the Data Set 。 。 。 。 。 。 。 。 。。 × × × × × × × × × × × × × ×

Rule Based on Observation

Rule Generated by the Proposed RBF(Radial Basis Function) Network Based Learning Algorithm Let and If then prediction=“O”. Otherwise prediction=“X”.

(15,33)(11,31)(18,32)(12,33)(15,35)(17,34)(14,32)(16,31)(13,34)(15,30) (9,23)(8,15)(13,37)(16,38)(18,28)(18,39)(25,18)(23,33)(21,28)(9,32)(11,38)(19,36)(10,34)(13,22)

Identifying Boundary of Different Classes of Objects

Boundary Identified

The Vector Space Model In the vector space model, each object is described by a number of numerical attributes. For example, the outlook of a man is described by his height, weight, and age.

Transformation of Categorical Attributes into Numerical Attributes Represent the attribute values of the object in a binary table form as exemplified in the following:

Assign appropriate weight to each column. Treat the weighted vector of each row as the feature vector of the corresponding object.

Application of Data Classification in Bioinformatics Data classification has been applied to predict the function and tertiary structure of a protein sequence.

Basics of Protein Structures A typical protein consists of hundreds to thousands of amino acids. There are 20 basic amino acids, each of which is denoted by one English character.

Three-dimensional Structure of Myoglobin Source: Lectures of BioInfo by yukijuan

Prediction of Protein Functions and Tertiary Structures Given a protein sequence, biochemists are interested in its functions and its tertiary structure.

The PDB database, which collects proteins with verified tertiary structures, contains ~19,000 proteins. The SWISSPROT database, which collects proteins with verified functions, contains ~110,000 proteins. The PIR-PSD database, which collects proteins with verified functions, contains ~280,000 proteins. The PIR-NREF database, which collects all protein sequences, contains ~1,060,000 proteins.

Problem Definition of Kernel Smoothing Given the values of function at a set of samples. We want to find a set of symmetric kernel functions and the corresponding weights such that

Kernel Smoothing with the Spherical Gaussian Functions Hartman et al. showed that a linear combination of spherical Gaussian functions can approximate any function with arbitrarily small error. “Layered neural networks with Gaussian hidden units as universal approximations”, Neural Computation, Vol. 2, No. 2, 1990.

With the Gaussian kernel functions, we want to find such that

Problem Definition of Kernel Density Estimation Assume that we are given a set of samples taken from a probability distribution in a d- dimensional vector space. The problem now is how to find a linear combination of kernel functions that approximate the probability density function of the distribution?

The value of the probability density function at a vector can be estimated as follows: where n is the total number of samples, is the distance between vector and its k-th nearest samples, and is the volume of a sphere with radius = in a d-dimensional vector space.

A 1-D Example of Kernel Smoothing with the Spherical Gaussian Functions

The Existing Approaches for Kernel Smoothing with Spherical Gaussian Functions One conventional approach is to place one Gaussian function at each sample. As a result, the problem becomes how to find for each sample such that

The most widely-used objective is to minimize where are test samples and S is the set of training samples. The conventional approach suffers high time complexity, approaching, due to the need to compute the inverse of a matrix.

M. Orr proposed a number of approaches to reduce the number of units in the hidden layer of the RBF network. Beatson et. al. proposed O(nlogn) learning algorithms using polyharmonic spline functions.

An O(n) Algorithm for Kernel Smoothing In the proposed learning algorithm, we assume uniform sampling. That is, samples are located at the crosses of an evenly- spaced grid in the d-dimensional vector space. Let denote the distance between two adjacent samples. If the assumption of uniform sampling does not hold, then some sort of interpolation can be conducted to obtain the approximate function values at the crosses of the grid.

A 2-D Example of Uniform Sampling

The Basic Idea of the O(n) Kernel Smoothing Algorithm Under the assumption that the sampling density is sufficiently high, i.e., we have the function values at a sample and its k nearest samples,, are virtually equal. That is,. In other words, is virtually a constant function equal to in the proximity of

Accordingly, we can expect that

A 1-D Example

In the 1-D example, samples at located at, where i is an integer. Under the assumption that, we have and The issue now is to find appropriate and such that

If we set,then we have

Therefore, with, we can set and obtain for

In fact, it can be shown that with, is bounded by Therefore, we have the following function approximator:

Generalization of the 1-D Kernel Smoothing Function We can generalize the result by setting, where is a real number. The table on the next page shows the bounds of with various values.

Bounds of

An Example of the Effect of Different Setting of β

The Smoothing Effect The kernel smoothing function is actually a weighted average of the sampled function values. Therefore, selecting a larger value implies that the smoothing effect will be more significant. Our suggestion is set

An Example of the Smoothing Effect The smoothing effect Elimination of the smoothing effect with a compensation procedure

The General Form of a Kernel Smoothing Function in the Multi- Dimensional Vector Space Under the assumption that the sampling density is sufficiently high, i.e., we have the function values at a sample and its k nearest samples,, are virtually equal. That is,.

As a result, we can expect that where are the weights and bandwidths of the Gaussian functions located at, respectively.

Since the influence of a Gaussian function decreases exponentially as the distance increases, we can set k to a value such that, for a vector in the proximity of sample, we have

Since we have our objective is to find and such that

Let Then, we have

Therefore, with, is virtually a constant function and Accordingly, we want to set

Finally, by setting uniformly to, we obtain the following kernel smoothing function that approximates f(v):

Generally speaking, if we set uniformly to, we will obtain

Application in Data Classification One of the applications of the RBF network is data classification. However, recent development in data classification focuses on the support vector machines (SVM), due to accuracy concern. In this paper, we propose a RBF network based data classifier that can delivers the same level of accuracy as the SVM and enjoys some advantages.

The Proposed RBF Network Based Classifier The proposed algorithm constructs one RBF network for approximating the probability density function of one class of objects based on the kernel smoothing algorithm that we just presented.

The Proposed Kernel Density Estimation Algorithm for Data Classification Classification of a new object is conducted based on the likelihood function:

Let us adopt the following estimation of the value of the probability density function at each training sample:

In the kernel smoothing problem, we set the bandwidth of each Gaussian function uniformly to, where is the distance between two adjacent training samples. In the kernel density estimation problem, for each training sample, we need to determine the average distance between two adjacent training samples of the same class in the local region.

In the d-dimensional vector space, if the average distance between samples is, then the number of samples in a subspace of volume V is approximately equal to Accordingly, we can estimate by

Accordingly, with the kernel smoothing function that we obtain earlier, we have the following approximate probability density function for class-m objects:

An interesting observation is that, regardless of the value of, we have. If the observation holds generally, then

In the discussion above, is defined to be the distance between sample and its nearest training sample. However, this definition depends on only one single sample and tends to be unreliable, if the data set is noisy. We can replace with

Parameter Tuning The discussions so far are based on the assumption that the sampling density is sufficiently high, which may not hold for some real data sets.

Three parameters, namely d’, k’, k”,are incorporated in the learning algorithm:

One may wonder how should be set. According to our experimental results, the value of has essentially no effect, as long as is set to a value within.

Time Complexity The average time complexity to construct a RBF network is if the k-d tree structure is employed, where n is the number of training samples. The time complexity to classify c new objects with unknown class is

Comparison of Classification Accuracy of the 6 Smaller Data Sets Data setsclassification algorithms proposed algorithmSVM1NN3NN 1. iris (150) (k’ = 24, k” = 14, d’ = 5,  = 0.7) wine (178) (k’ = 3, k” = 16, d’ = 1,  = 0.7) vowel (528) (k’ = 15, k” = 1, d’ = 1,  = 0.7) segment (2310) (k’ = 25, k” = 1, d’ = 1,  = 0.7) Avg glass (214) (k’ = 9, k” = 3, d’ = 2,  = 0.7) vehicle (846) (k’ = 13, k” = 8, d’ = 2,  = 0.7) Avg

Data setsclassification algorithms proposed algorithmSVM1NN3NN 7. satimage (4435,2000) (k’ = 6, k” = 26, d’ = 1,  = 0.7) letter (15000,5000) (k’ = 28, k” = 28, d’ = 2,  = 0.7) shuttle (43500,14500) (k’ = 18, k” = 1, d’ = 3,  = 0.7) Avg Comparison of Classification Accuracy of the 3 Larger Data Sets

Data Reduction As the proposed learning algorithm is instance- based, removal of redundant training samples will lower the complexity of the RBF network. The effect of a naïve data reduction mechanism was studied. The naïve mechanism removes a training sample, if all of its 10 nearest samples belong to the same class as this particular sample.

Effect of Data Reduction satimagelettershuttle # of training samples in the original data set # of training samples after data reduction is applied % of training samples remaining 40.92%51.96%1.44% Classification accuracy after data reduction is applied 92.15%96.18%99.32% Degradation of accuracy due to data reduction -0.15%-0.94%-0.62%

# of training samples after data reduction is applied # of support vectors identified by LIBSVM satimage letter shuttle627287

Execution Times (in seconds) Proposed algorithm without data reduction Proposed algorithm with data reduction SVM Cross validationsatimage letter shuttle Make classifiersatimage letter shuttle Testsatimage letter shuttle

Conclusions A novel learning algorithm for data classification with the RBF network is proposed. The proposed RBF network based data classification algorithm delivers the same level of accuracy as the SVM.

The time complexity for constructing a RBF network based on the proposed algorithm is, which is much lower than that required by the SVM. The proposed RBF network based classifier can handle data sets with multiple classes directly. It is of interest to develop some sort of data reduction mechanisms.

The powerpoint file of this presentation can be downloaded from syslab.csie.ntu.edu.tw An extended version of the presented paper can be downloaded at the same address.