Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A AAAA A A A AA A A A A A
Induction vs. Transduction Inductive learning: Distribution of examples training set learning algorithm hypothesislabels unlabeled examples Transductive learning (Vapnik ’74,’98): training set test set learning algorithm labels of the test set Goal: minimize
Distribution-free Model [Vapnik ’74,’98] X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X Given: “Full sample” of unlabeled examples, each with its true (unknown) label.
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X Full sample is partitioned: training set ( m points) test set ( u points) Distribution-free Model [Vapnik ’74,’98]
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X Labels of the training examples are revealed. Given: “Full sample” of unlabeled examples, each with its true (unknown) label. Full sample is partitioned: training set ( m points) test set ( u points) Distribution-free Model [Vapnik ’74,’98]
Labels of the training points are revealed. Goal: Label test examples X ? ? X ? ? ? ? X ? ? ? ? ? ? ? X ? X ? ? ? ? ? ? ? X ? ? ? ? ? ? X ? ? ? Given: “Full sample” of unlabeled examples, each with its true (unknown) label. Full sample is partitioned: training set ( m points) test set ( u points) Distribution-free Model [Vapnik ’74,’98]
Rademacher complexity Induction Hypothesis space : set of functions. - training points. - i.i.d. random variables, Rademacher: Transduction (version 1) Hypothesis space : set of vectors,. - full sample with training and test points. - distributed as in induction. Rademacher:
Transductive Rademacher complexity Version 1: - full sample with training and test points. - transductive hypothesis space. - i.i.d. random variables distributed by :. Rademacher complexity: Version 2: sparse distribution,, of Rademacher variables We develop risk bounds with. Lemma 1:.
Risk bound Notation: - 0/1 error of on test examples. - empirical -margin error of on training examples. Theorem: For any, with probability at least over the random partition of the full sample into, for all hypotheses it holds that. Proof: based on and inspired by the results of [McDiarmid, ‘89], [Bartlett and Mendelson, ‘02] and [Meir and Zhang, ‘03]. Previous results: [Lanckriet et al., ‘04] - case of.
Inductive vs. Transductive hypothesis spaces Induction: To use the risk bounds, the hypothesis space should be defined before observing the training set. Transduction: The hypothesis space can be defined after observing, but before observing the actual partition. Conclusion: Transduction allows for the choosing a data-dependent hypothesis space. For example, it can be optimized to have low Rademacher complexity. This cannot be done in induction!
Another view on transductive algorithms learner compute matrix vector Example: - inverse of graph Laplacian iff ; otherwise. Unlabeled-Labeled Decomposition (ULD)
Bounding Rademacher complexity Hypothesis space : the set of all, obtained by operating transductive algorithm on all possible partitions. Notation:, - set of ‘s generated by. - all singular values of. Lemma 2: Lemma 2 justifies the spectral transformations performed to improve the performance of transductive algorithms ([Chapelle et al.,’02], [Joachims,’03], [Zhang and Ando,‘05])..
Bounds for graph-based algorithms Consistency method [Zhou, Bousquet, Lal, Weston, Scholkopf, ‘03]: where are singular values of. Similar bounds for the algorithms of [Joachims,’03], [Belkin et al., ‘04], etc.
Topics not covered Bounding the Rademacher complexity when is a kernel matrix. For some algorithms: data-dependent method of computing probabilistic upper and lower bounds on Rademacher complexity. Risk bound for transductive mixtures.
Direction for future research Tighten the risk bound to allow effective model selection: Bound depending on 0/1 empirical error. Usage of variance information to obtain better convergence rate. Local transductive Rademacher complexity. Clever data-dependent choice of low-Rademacher hypothesis spaces.
Monte Carlo estimation of transductive Rademacher complexity Rademacher:. Draw uniformly vectors of Rademacher variables,. By Hoeffding inequality: for any, with prob. at least,. How to compute the supremum? For the Consistency Method of [Zhou et al., ‘03] can be computed in time. Symmetric Hoeffding inequality probabilistic lower bound on the transductive Rademacher complexity.
Induction vs. Transduction: differences Induction Unknown underlying distribution Transduction No unknown distribution. Each example has unique label. Test examples not known. Will be sampled from the same distribution. Test examples are known. Generate a general hypothesis. Want generalization! Only classify given examples. No generalization! Independent training examples. Dependent training and test examples.
Justification of spectral transformations, - set of ‘s generated by. - all singular values of. Lemma 2:. Lemma 2 justifies the spectral transformations performed to improve the performance of transductive algorithms ([Chapelle et al.,’02], [Joachims,’03], [Zhang and Ando,‘05]).