New Theoretical Frameworks for Machine Learning

Slides:



Advertisements
Similar presentations
Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
Mechanism Design, Machine Learning, and Pricing Problems Maria-Florina Balcan.
Boosting Approach to ML
On a Theory of Similarity Functions for Learning and Clustering Avrim Blum Carnegie Mellon University [Includes work joint with Nina Balcan, Nati Srebro,
Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data.
Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012] And Other Cool Stuff.
A general agnostic active learning algorithm
Semi-Supervised Learning
Maria-Florina Balcan Approximation Algorithms and Online Mechanisms for Item Pricing Maria-Florina Balcan & Avrim Blum CMU, CSD.
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
Machine Learning Week 2 Lecture 2.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Active Learning of Binary Classifiers
The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Probably Approximately Correct Model (PAC)
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
Active Learning with Support Vector Machines
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Machine Learning for Mechanism Design and Pricing Problems Avrim Blum Carnegie Mellon University Joint work with Maria-Florina Balcan, Jason Hartline,
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint with Avrim Blum and Santosh Vempala.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
Machine Learning Theory Maria-Florina Balcan Lecture 1, Jan. 12 th 2010.
Online Learning Algorithms
Incorporating Unlabeled Data in the Learning Process
Machine Learning Theory Maria-Florina (Nina) Balcan Lecture 1, August 23 rd 2011.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Universit at Dortmund, LS VIII
Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer
Maria-Florina Balcan Mechanism Design, Machine Learning and Pricing Problems Maria-Florina Balcan Joint work with Avrim Blum, Jason Hartline, and Yishay.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Machine Learning Concept Learning General-to Specific Ordering
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
NTU & MSRA Ming-Feng Tsai
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning with General Similarity Functions Maria-Florina Balcan.
On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Correlation Clustering
CS 9633 Machine Learning Concept Learning
CH. 2: Supervised Learning
Semi-Supervised Learning
Mechanism Design via Machine Learning
Computational Learning Theory
Computational Learning Theory
Maria Florina Balcan 03/04/2010
Welcome to the Kernel-Club
CSCI B609: “Foundations of Data Science”
A Theory of Learning and Clustering via Similarity Functions
Presentation transcript:

New Theoretical Frameworks for Machine Learning Maria-Florina Balcan Thesis Proposal 05/15/2007

Thanks to My Committee Avrim Blum Manuel Blum Tom Mitchell Insert pictures Yishay Mansour Santosh Vempala

The Goal of the Thesis New Theoretical Frameworks for Modern Machine Learning Paradigms Connections between Machine Learning Theory and Algorithmic Game Theory

New Frameworks for Modern Learning Paradigms Incorporating Unlabeled Data in the Learning Process Kernel based Learning Qualitative gap between theory and practice Semi-supervised Learning Unified theoretical treatment is lacking Active Learning Our Contributions Our Contributions Semi-supervised learning A theory of learning with general similarity functions - a unified PAC framework Active Learning Extensions to clustering - new positive theoretical results With Avrim and Santosh

New Frameworks for Modern Learning Paradigms Incorporating Unlabeled Data in the Learning Process Kernel, Similarity based Learning and Clustering Qualitative gap between theory and practice Unified theoretical treatment is lacking Our Contributions Our Contributions Semi-supervised learning A theory of learning with general similarity functions - a unified PAC framework Active Learning Extensions to clustering - new positive theoretical results With Avrim and Santosh

Machine Learning Theory and Algorithmic Game Theory Brief Overview of Our Results Mechanism Design, ML, and Pricing Problems Generic Framework for reducing problems of incentive-compatible mechanism design to standard algorithmic questions. [Balcan-Blum-Hartline-Mansour, FOCS 2005, JCSS 2007] Approximation Algorithms for Item Pricing. [Balcan-Blum, EC 2006] Revenue maximization in comb. auctions with single-minded consumers

The Goal of the Thesis New Theoretical Frameworks for Modern Machine Learning Paradigms Semi-Supervised and Active Learning Similarity Based Learning and Clustering Connections between Machine Learning Theory and Algorithmic Game Theory Use MLT techniques for designing and analyzing auctions in the context of Revenue Maximization

The Goal of the Thesis New Theoretical Frameworks for Modern Machine Learning Paradigms Incorporating Unlabeled Data in the Learning Process Kernel, Similarity based learning and Clustering Semi-supervised learning (SSL) - Connections between kernels, margins and feature selection - An Augmented PAC model for SSL [Balcan-Blum, COLT 2005; book chapter, “Semi-Supervised Learning”, 2006] [Balcan-Blum-Vempala, MLJ 2006] - A general theory of learning with similarity functions Active Learning (AL) - Generic agnostic AL procedure [Balcan-Blum, ICML 2006] [Balcan-Beygelzimer-Langford, ICML 2006] - Extensions to Clustering - Margin based AL of linear separators [Balcan-Blum-Vempala, work in progress] [Balcan-Broder-Zhang, COLT 2007]

The Goal of the Thesis New Theoretical Frameworks for Modern Machine Learning Paradigms Incorporating Unlabeled Data in the Learning Process Kernel, Similarity based learning and Clustering Semi-supervised learning (SSL) - Connections between kernels, margins and feature selection - An Augmented PAC model for SSL [Balcan-Blum, COLT 2005; book chapter, “Semi-Supervised Learning”, 2006] [Balcan-Blum-Vempala, MLJ 2006] - A general theory of learning with similarity functions Active Learning (AL) - Generic agnostic AL procedure [Balcan-Blum, ICML 2006] [Balcan-Beygelzimer-Langford, ICML 2006] - Extensions to Clustering - Margin based AL of linear separators [Balcan-Blum-Vempala, work in progress] [Balcan-Broder-Zhang, COLT 2007]

Outline of this talk New Theoretical Frameworks for Modern Machine Learning Paradigms Incorporating Unlabeled Data in the Learning Process Semi-Supervised Learning Similarity Based Learning and Clustering A Theory of Learning with General Similarity Functions A Theory of Similarity Functions for Clustering

Part I, Incorporating Unlabeled Data in the Learning Process Semi-Supervised Learning A unified PAC-style framework [Balcan-Blum, COLT 2005; book chapter, “Semi-Supervised Learning”, 2006]

Standard Supervised Learning Setting X – instance/feature space S={(x, l)} - set of labeled examples labeled examples - assumed to be drawn i.i.d. from some distr. D over X and labeled by some target concept c* 2 C labels 2 {-1,1} - binary classification Want to do optimization over S to find some hypothesis h, but we want h to have small error over D. err(h)=Prx 2 D(h(x)  c*(x)) Classic models for learning from labeled data. Statistical Learning Theory (Vapnik) PAC (Valiant)

Standard Supervised Learning Setting Sample Complexity E.g., Finite Hypothesis Spaces, Realizable Case And we can of course can replace this term …, … In PAC, can also talk about efficient algorithms.

Semi-Supervised Learning Hot topic in recent years in Machine Learning. Several methods have been developed to try to use unlabeled data to improve performance, e.g.: Transductive SVM [Joachims ’98] Co-training [Blum & Mitchell ’98], [Balcan-Blum-Yang’04] Graph-based methods [Blum & Chawla01], [ZGL03] Scattered Theoretical Results…

An Augmented PAC model for SSL [BB05] Extends PAC naturally to fit SSL. Can generically analyze: When will unlabeled data help and by how much. How much data should I expect to need to perform well. Key Insight Unlabeled data is useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution. Under what conditions will unlabeled data help and by how much? How much data should I expect to need in order to perform well? Different algorithms are based on different assumptions about how data should behave. Challenge – how to capture many of the assumptions typically used.

Example of “typical” assumption: Margins The separator goes through low density regions of the space/large margin. assume we are looking for linear separator belief: should exist one with large separation + _ Labeled data only Transductive SVM SVM

Another Example: Self-consistency Agreement between two parts : co-training [BM98]. - examples contain two sufficient sets of features, x = h x1, x2 i - the belief is that the two parts of the example are consistent, i.e. 9 c1, c2 such that c1(x1)=c2(x2)=c*(x) For example, if we want to classify web pages: x = h x1, x2 i My Advisor Prof. Avrim Blum x1- Text info x2- Link info x - Link info & Text info

Problems thinking about SSL in the PAC model Su={xi} - unlabeled examples drawn i.i.d. from D Sl={(xi, yi)} – labeled examples drawn i.i.d. from D and labeled by some target concept c*. PAC model talks of learning a class C under (known or unknown) distribution D. Not clear what unlabeled data can do for you. Doesn’t give you any info about which c 2 C is the target function. Different model: the learner gets to pick the examples to be labeled – Active Learning. We extend the PAC model to capture these (and more) uses of unlabeled data. Give a unified framework for understanding when and why unlabeled data can help.

Proposed Model, Main Idea (1) Augment the notion of a concept class C with a notion of compatibility  between a concept and the data distribution. “learn C” becomes “learn (C,)” (i.e. learn class C under compatibility notion ) Express relationships that one hopes the target function and underlying distribution will possess. + _ Idea: use unlabeled data & the belief that the target is compatible to reduce C down to just {the highly compatible functions in C}.

Proposed Model, Main Idea (2) Idea: use unlabeled data & our belief to reduce size(C) down to size(highly compatible functions in C) in our sample complexity bounds. Need to be able to analyze how much unlabeled data is needed to uniformly estimate compatibilities well. Require that the degree of compatibility be something that can be estimated from a finite sample. Require  to be an expectation over individual examples: (h,D)=Ex2 D[(h, x)] compatibility of h with D, (h,x)2 [0,1] errunl(h)=1-(h, D) incompatibility of h with D (unlabeled error rate of h)

Margins, Compatibility Margins: belief is that should exist a large margin separator. + _ Highly compatible Incompatibility of h and D (unlabeled error rate of h) – the probability mass within distance  of h. Can be written as an expectation over individual examples (h,D)=Ex 2 D[(h,x)] where: (h,x)=0 if dist(x,h) ·  (h,x)=1 if dist(x,h) ¸ 

Margins, Compatibility Margins: belief is that should exist a large margin separator. + _ Highly compatible If do not want to commit to  in advance, define (h,x) to be a smooth function of dist(x,h), e.g.: Illegal notion of compatibility: the largest  s.t. D has probability mass exactly zero within distance  of h.

Co-Training, Compatibility Co-training: examples come as pairs hx1, x2i and the goal is to learn a pair of functions hh1,h2i. Hope is that the two parts of the example are consistent. Legal (and natural) notion of compatibility: - the compatibility of hh1,h2i and D: - can be written as an expectation over examples:

Types of Results in the [BB05] Model As in PAC, can discuss algorithmic and sample complexity issues. Sample Complexity issues that we can address: How much unlabeled data we need: depends both on the complexity of C and the on the complexity of our notion of compatibility. - Ability of unlabeled data to reduce # of labeled examples needed: compatibility of the target As in the usual PAC model, can discuss algorithmic and sample complexity issues. The epsilon cover based bounds are very natural in our setting!!!! (various) measures of the helpfulness of the distribution Give both uniform convergence bounds and epsilon-cover based bounds.

Examples of results: Sample Complexity, Uniform Convergence Bounds Finite Hypothesis Spaces, Doubly Realizable Case ALG: pick a compatible concept that agrees with the labeled sample. CD,() = {h 2 C :errunl(h) ·} a helpful distribution is one in which CD, () is small Bound the # of labeled examples as a measure of the helpfulness of D with respect to  helpful D is one in which CD, () is small

Examples of results: Sample Complexity, Uniform Convergence Bounds Finite Hypothesis Spaces, Doubly Realizable Case ALG: pick a compatible concept that agrees with the labeled sample. CD,() = {h 2 C :errunl(h) ·} a helpful distribution is one in which CD, () is small Highly compatible + _

Sample Complexity Subtleties Uniform Convergence Bounds Depends both on the complexity of C and on the complexity of  Distr. dependent measure of complexity furthermore all $h \in C$ have $$|err_{unl}(h)-\widehat{err}_{unl}(h)| \leq \varepsilon$$ Highly compatible + _ -Cover bounds much better than Uniform Convergence bounds. For algorithms that behave in a specific way: first use the unlabeled data to choose a representative set of compatible hypotheses then use the labeled sample to choose among these

Sample Complexity Implications of Our Analysis Ways in which unlabeled data can help If c* is highly compatible and have enough unlabeled data, then can reduce the search space (from C down to just those h 2 C whose estimated unlabeled error rate is low). By providing an estimate of D, unlabeled data can allow a more refined distribution-specific notion of hypothesis space size (e.g. the size of the smallest -cover). Subsequent Work, E.g.: P. Bartlett, D. Rosenberg, AISTATS 2007 J. Shawe-Taylor et al., Neurocomputing 2007

Efficient Co-training of linear separators Examples h x1, x2 i 2 Rn £ Rn. Target functions c1 and c2 are linear separators, assume c1=c2=c*, and that no pair crosses the target plane. f linear separator in Rn, errunl(f) - the fraction of the pairs that “cross f’s boundary” Consistency problem: given a set of labeled and unlabeled examples, find a separator that is consistent with labeled examples and compatible with the unlabeled ones. It is NP-hard – Abie Flaxman. + -

Efficient Co-training of linear separators Assume independence given the label both points from D+ or from D-. [Blum & Mitchell] show can co-train (in polynomial time) if have enough labeled data to produce a weakly-useful hypothesis to begin with. [BB05] shows we can learn (in polynomial time) with only a single labeled example. Key point: independence given the label implies that the functions with low errunl rate are: close to c* close to : c* close to the all positive function close to the all negative function Idea: use unlabeled data to generate poly # of candidate hyps s.t. at least one is weakly-useful (uses Outlier Removal Lemma). Plug into [BM98].

Super simple algorithm for weak learning a large-margin separator Nice Tool: a “super simple algorithm” for weak learning a large-margin separator: pick h at random If margin=1/poly(n), then a random h has at least 1/poly(n) chance of being a weak predictor

Efficient Co-training of linear separators Assume independence given the label. Draw a large unlabeled sample S={(x1i,x2i)}. If also assume large margin, run the “super-simple alg” poly(n) times feed each c into [Blum & Mitchell] procedure examine all the hyp. produced., and pick one h with small errunl, that is far from all-positive and all-negative fns use one labeled example to choose either h or : h Proof idea W.h.p. one random c was a weakly-useful predictor; so on at least one of these steps we end up with a hyp. h with small err(h), and so with small errunl(h). If don’t assume large margin, use Outlier Removal Lemma to make sure that at least 1/poly fraction of the points in S1={x1i} have margin at least 1/poly; this is sufficient.

Modern Learning Paradigms: Our Contributions Incorporating Unlabeled Data in the Learning Process Kernel, Similarity based learning and Clustering Semi-supervised learning (SSL) - Connections between kernels, margins and feature selection - An Augmented PAC model for SSL [Balcan-Blum, COLT 2005] [Balcan-Blum-Vempala, MLJ 2006] [Balcan-Blum, book chapter, “Semi-Supervised Learning”, 2006] - A general theory of learning with similarity functions Active Learning (AL) [Balcan-Blum, ICML 2006] - Generic agnostic AL procedure - Extensions to Clustering [Balcan-Beygelzimer-Langford, ICML 2006] - Margin based AL of linear separators [Balcan-Blum-Vempala, work in progress] [Balcan-Broder-Zhang, COLT 2007]

Modern Learning Paradigms: Our Contributions Incorporating Unlabeled Data in the Learning Process Kernel, Similarity based learning and Clustering Semi-supervised learning (SSL) - Connections between kernels, margins and feature selection - An Augmented PAC model for SSL [Balcan-Blum, COLT 2005] [Balcan-Blum-Vempala, MLJ 2006] [Balcan-Blum, book chapter, “Semi-Supervised Learning”, 2006] - A general theory of learning with similarity functions Active Learning (AL) [Balcan-Blum, ICML 2006] - Generic agnostic AL procedure - Extensions to Clustering [Balcan-Beygelzimer-Langford, ICML 2006] - Margin based AL of linear separators [Balcan-Blum-Vempala, work in progress] [Balcan-Broder-Zhang, COLT 2007]

Part II, Similarity Functions for Learning [Balcan-Blum, ICML 2006] Extensions to Clustering (With Avrim and Santosh, work in progress)

Kernels and Similarity Functions Kernels have become a powerful tool in ML. Useful in practice for dealing with many different kinds of data. Elegant theory about what makes a given kernel good for a given learning problem. Our Work: analyze more general similarity functions. In the process we describe ways of constructing good data dependent kernels.

Kernels A kernel K is a pairwise similarity function s.t. 9 an implicit mapping  s.t. K(x,y)=(x) ¢ (y). Point is: many learning algorithms can be written so only interact with data via dot-products. If replace x¢y with K(x,y), it acts implicitly as if data was in higher-dimensional -space. If data is linearly separable by large margin in -space, don’t have to pay in terms of data or comp time. w  (x)  1 If margin  in -space, only need 1/2 examples to learn well.

General Similarity Functions We provide: characterization of good similarity functions for a learning problem that: 1) Talks in terms of natural direct properties: no implicit high-dimensional spaces no requirement of positive-semidefiniteness 2) If K satisfies these properties for our given problem, then has implications to learning. 3) Is broad: includes usual notion of “good kernel”. (induces a large margin separator in -space)

A First Attempt: Definition satisfying properties (1) and (2) Let P be a distribution over labeled examples (x, l(x)) K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ Suppose that positives have K(x,y) ¸ 0.2, negatives have K(x,y) ¸ 0.2, but for a positive and a negative K(x,y) are uniform random in [-1,1]. Note: this might not be a legal kernel. A B C + -

Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ A First Attempt: Definition satisfying properties (1) and (2). How to use it? K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ Algorithm Draw S+ of O((1/2) ln(1/2)) positive examples. Draw S- of O((1/2) ln(1/2)) negative examples. Classify x based on which gives better score.

A First Attempt: How to use it? K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ Algorithm Draw S+ of O((1/2) ln(1/2)) positive examples. Draw S- of O((1/2) ln(1/2)) negative examples. Classify x based on which gives better score. Guarantee: with probability ¸ 1-, error ·  + . Proof Hoeffding: for any given “good x”, probability of error w.r.t. x (over draw of S+, S-) at most 2. By Markov, at most  chance that the error rate over GOOD is more than . So overall error rate ·  + .

A First Attempt: Not Broad Enough K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ + - more similar to negs than to typical pos K(x,y)=x ¢ y has large margin separator but doesn’t satisfy our definition.

A First Attempt: Not Broad Enough K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1- probability mass of x satisfy: Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+ R + + + + + + - - - - - - Large region == non-negligable Idea: would work if we didn’t pick y’s from top-left. Broaden to say: OK if 9 non-negligable region R s.t. most x are on average more similar to y2R of same label than to y2 R of other label.

Broader/Main Definition K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1- probability mass of x satisfy: Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+

Main Definition, How to Use It K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1- probability mass of x satisfy: Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+ Algorithm Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). Use to “triangulate” data: F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)]. Take a new set of labeled examples, project to this space, and run your favorite alg for learning lin. separators. Point is: with probability ¸ 1-, exists linear separator of error ·  + at margin /4. (w = [w(y1), …,w(yd),-w(zd),…,-w(zd)])

Main Definition, Implications Algorithm Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). Use to “triangulate” data: F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)]. Guarantee: with prob. ¸ 1-, exists linear separator of error ·  + at margin /4. legal kernel Implications K arbitrary sim. function (,)-good sim. function (+,/4)-good kernel function

Good Kernels are Good Similarity Functions Main Definition: K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1- probability mass of x satisfy: Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+ Theorem An (,)-good kernel is an (’,’)-good similarity function under main definition. Our proofs incurred some penalty: ’ =  + extra, ’ = 3extra. Nati Srebro (COLT 2007) has improved the bounds.

Good Kernels are Good Similarity Functions Theorem An (,)-good kernel is an (’,’)-good similarity function under main definition, where ’ =  + extra, ’ = 3extra. Proof Sketch Suppose K is a good kernel in usual sense. Then, standard margin bounds imply: if S is a random sample of size Õ(1/(2)), then whp we can give weights wS(y) to all examples y 2 S so that the weighted sum of these examples defines a good LTF. But, we want sample-independent weights [and bounded]. Boundedness not too hard (imagine a margin-perceptron run over just the good y). Get sample-independence using an averaging argument.

Good Kernels are Good Similarity Functions Main Definition: K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1- probability mass of x satisfy: Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+ Theorem An (,)-good kernel is an (’,’)-good similarity function under main definition. Our proofs incurred some penalty: Insert statement ’ =  + extra, ’ = 3extra. Nati Srebro (COLT 2007) has improved the bounds.

Learning with Multiple Similarity Functions Let K1, …, Kr be similarity functions s. t. some (unknown) convex combination of them is (,)-good. Algorithm Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). Use to “triangulate” data: F(x) = [K1(x,y1), …,Kr(x,yd), K1(x,zd),…,Kr(x,zd)]. Guarantee: The induced distribution F(P) in R2dr has a separator of error ·  +  at margin at least Sample complexity is roughly

Implications Theory that provides a formal way of understanding kernels as similarity functions. Algorithms work for sim. fns that aren’t necessarily PSD. Suggests natural approach for using similarity functions to augment feature vector in “anytime” way. E.g., features for document can be list of words in it, plus similarity to a few “landmark” documents. Formal justification for “Feature Generation for Text Categorization using World Knowledge”, GM’05 Mugizi has proposed on this

Clustering via Similarity Functions (Work in Progress, with Avrim and Santosh)

What if only unlabeled examples available? Consider the following setting: Given data set S of n objects. There is some (unknown) “ground truth” clustering. Each x has true label l(x) in {1,…,t}. Goal: produce hypothesis h of low error up to isomorphism of label names. [documents, web pages] [topic] Documents, web pages. Cluster by topic. People have traditionally considered mixture models here. Can we say something in our setting?

What if only unlabeled examples available? Suppose our similarity function satisfies the stronger condition: Ground truth is “stable” in that For all clusters C, C’, for all A in C, A’ in C’: A and A’ are not both more attracted to each other than to their own clusters. K(x,y) is attraction between x and y Then, can construct a tree (hierarchical clustering) such that the correct clustering is some pruning of this tree.

What if only unlabeled examples available? Suppose our similarity function satisfies the stronger condition: Ground truth is “stable” in that For all clusters C, C’, for all A in C, A’ in C’: A and A’ are not both more attracted to each other than to their own clusters. K(x,y) is attraction between x and y fashion sports volleyball Dolce & Gabbana soccer Cocco Chanel gymnastics

Main point Exploring the question: what are minimal conditions on a similarity function that allow it to be useful for clustering? Have considered two relaxations of the Clustering objective: List Clustering -- small number of candidate clusterings. Hierarchical clustering -- output a tree such that right answer is some pruning of it. Allow for right answer to be identified with a little bit of additional feedback.

Modern Learning Paradigms: Future Work Incorporating Unlabeled Data in the Learning Process Kernel, Similarity based learning and Clustering Active Learning Learning with Sim. Functions - Margin based AL of linear separators Alternative/tighter definitions and connections. Extend the analysis to a more general class of distributions, e.g. log-concave. Clustering via Sim. Functions Can we get an efficient alg. for the stability of large subsets property Interactive Feedback

MLA and Algorithmic Game Theory, Future Work Mechanism Design, ML, and Pricing Problems Revenue maximization in comb. auctions with general preferences. Extend BBHM’05 to the limited supply setting. Approximation algorithms for the case of pricing below cost.

Timeline Plan to finish in a year  Wrap-up; writing; job search! Summer 07 - Revenue Maximization in General Comb. Auctions, limited and unlimited supply. Fall 07 Clustering via Similarity Functions Active Learning under Log-Concave Distributions Spring 08 Wrap-up; writing; job search!

Thank you !