Learning with information of features

Learning with information of features 2009-06-05

Contents Motivation Incorporating prior knowledge on features into learning (AISTATS’07) Regularized learning with networks of features (NIPS’08) Conclusion Company name

+ prior information of samples
Motivation Given data X∈Rn×d + prior information of samples Manifold structure information LAPSVM Transformation invariance VSVM, ISSVM Permutation invariance π- SVM Imbalance information SVM for imbalance distribution Cluster structure information Structure SVM Company name

Information in the sample space (space spanned by samples)
Motivation Information in the sample space (space spanned by samples) Company name

Motivation Prior information in the feature or attribute space (space spanned by features) Company name

+ prior information of features
Motivation + prior information of features for better generalization Company name

Kernel design by meta-features A toy example
Incorporating prior knowledge on features into learning (AISTATS’07) Motivation Kernel design by meta-features A toy example Handwritten digit recognition aided by meta-features Towards a theory of meta-features Company name

Image recognition task
Incorporating prior knowledge on features into learning (AISTATS’07) Image recognition task Feature : pixel (gray level) Coordinate (x,y) of pixel can be treated as Feature of features: meta-feature Feature with similar meta-feature, or more specifically, adjacent pixel should be assigned similar weights. Propose a framework incorporating meta-features into learning Company name

In the standard approach of linear SVM, we solve
Kernel design by meta-features In the standard approach of linear SVM, we solve which can be viewed as finding the maximum-likelihood hypothesis, under the above constraint, where we have a Gaussian prior on w The covariance matrix C equals the unit matrix, i.e. all weights are assumed to be independent and have the same variance. Company name

Kernel design by meta-features
Kernel design by meta-features We can use meta-feature to create a better prior on w : features with similar meta-feature are expected to be similar in weights, i.e., the weights would be a smooth function of the meta-features. Use a Gaussian prior on w, defined by a covariance matrix C, and the covariance between a pair of weights is taken to be a decreasing function of the distance between their meta-features. Company name

Kernel design by meta-features
Kernel design by meta-features The invariance is incorporated by the assumption of smoothness of weights in the meta-feature space. Gaussian process: xy, smoothness of y in the feature space. This work: uw, smoothness of weight w in the meta-feature space. Company name

A toy problem MINIST dataset (2 vs. 5) www.themegallery.com
Company name

Handwritten digit recognition aided by meta-features
Handwritten digit recognition aided by meta-features Company name

Handwritten digit recognition aided by meta-features Define features and meta-features same height for all isosceles triangle Company name

Handwritten digit recognition aided by meta-features 3 inputs: 40×20×20=16000 (40 stands for ur and uφ and 20×20 for the center position. ) 2 inputs: 8000 (same feature for a rotation of 180○) Total features. Company name

Handwritten digit recognition aided by meta-features Define covariance matrix The weights across features with different sizes, orientations or number of inputs are uncorrelated. 40+20 identical blocks of size 400×400. Company name

Handwritten digit recognition aided by meta-features Company name

Towards a theory of meta-features
Towards a theory of meta-features Company name

Company name

Regularized learning with networks of features (NIPS’08)
Regularized learning with networks of features (NIPS’08) Motivation Regularized learning with networks of features Extensions to feature network regularization Experiment Company name

Motivation Supervised learning problems, we may know which features yield similar information about the target variable. Predicting the topic of a document, we know two words are synonyms. Image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Company name

Regularized learning with networks of features
Regularized learning with networks of features A directed network or graph of features, G: Vertices: the features of the model Edges: link features whose weights are believed to be similar. Pij : the weight of the directed edge from vertex i to vertex j. Company name

Regularized learning with networks of features
Regularized learning with networks of features Minimize above loss function is equivalent to finding the MAP estimate for w, and w is a priori normally distributed with mean zero and covariance matrix 2M-1. If P is sparse (only kd entries for k<<d), then the additional matrix multiply is O(d), but the constructed covariance structure over w can be dense. The feature network regularization penalty is identical to LLE except that the embedding is found for feature weights rather than data instances. Company name

Extensions to feature network regularization
Extensions to feature network regularization Regularizing with classes of features (In machine learning, features can often be grouped into classes, such that all weights of the features in a given class are drawn from the same underlying distribution.) k disjoint classes of features whose weights are drawn i.i.d. N(μi, σ2) with μi unknown but σ2 known and shared across all classes. The number of edges in this construction scales quadratically in the clique sizes, resulting in feature graphs that are not sparse. Company name

Extensions to feature network regularization Solution : 1 uk can be optimized Company name

Extensions to feature network regularization Incorporating Feature Dissimilarities Regularizing Features with the Graph Laplacian Network penalty: penalizes each feature equally Graph penalty: penalizes each edge equally The Laplacian penalty will focus most of the regularization cost on features with many neighbors. Company name

Experiments Experiments on 20 Newsgroups
Experiments Experiments on 20 Newsgroups Features: 11,376 words occurred in at least 20 documents. Feature similarity : a binary vector denoting its presence/absence in 20,000 documents, cosines between binary vectors. (25 neighbors) Company name

Experiments on 20 Newsgroups
Experiments on 20 Newsgroups Company name

Experiments Experiments on sentiment classification
Experiments Experiments on sentiment classification (Product review datasets, sentimentally-charged words from the SentiWord-Net datasets) words from SentiWordNet which also occurred in the product reviews at least 100 times. Words with high positive and negative sentiment scores to form ‘positive word cluster’ and ‘negative word cluster’, also two virtual features and a dissimilarity edge between them. Company name

Sentiment Classification
Sentiment Classification Company name

Sentiment Classification 2. Computed the correlations of all features with the SentiWord-Net features so that each word was represented as a 200 dimensional vector of correlations with these highly charged sentiment words. Feature similarity can be computed from those vectors. (100 nearest neighbors) Company name

Sentiment Classification Company name

The discrimination of individual features.
Conclusion Smoothness assumption of feature weights. Restrict to certain application, the define of meta-feature or feature similarity graph. Feature information derived directly from the given data? The discrimination of individual features. Company name

Conclusion Fisher’s discriminant ratio (F1)
Conclusion Fisher’s discriminant ratio (F1) Emphasize on the geometrical characteristics of the class distributions, or more specifically, the manner in which classes are separated which is the most critical for classification accuracy. Ratio of the separated region (F2) Feature efficiency (F3) Company name

Conclusion Feature weights are penalized by the degree as a decreasing function of their individual discrimination values so that features with better discrimination can be attached more attention , as they have manifest greater importance in separating data correctly. Company name

Conclusion Company name

Thank You !

Learning with information of features

Similar presentations

Presentation on theme: "Learning with information of features"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning with information of features

Similar presentations

Presentation on theme: "Learning with information of features"— Presentation transcript:

Similar presentations

About project

Feedback