Relaxed Transfer of Different Classes via Spectral Partition Xiaoxiao Shi 1 Wei Fan 2 Qiang Yang 3 Jiangtao Ren 4 1 University of Illinois at Chicago 2 IBM T. J. Watson Research Center 3 Hong Kong University of Science and Technology 4 Sun Yat-sen University 1.Unsupervised 2.Can use data with different classes to help. How so?
22 What is Transfer Learning? New York Times training (labeled) test (unlabeled) Classifier New York Times 85.5% Standard Supervised Learning
33 New York Times training (labeled) test (unlabeled) New York Times Labeled data are insufficient! 47.3% How to improve the performance? In Reality… What is Transfer Learning?
44 Reuters Source domain training (labeled) Target domain test (unlabeled) Transfer Classifier New York Times 82.6% Not necessary from the same domain and do not follow the same distribution
5 Reuters Source domain training (labeled) Target domain test (unlabeled) Transfer Classifier New York Times 82.6% Since they are from different domains, they may have different class labels! Labels: Markets Politics Entertainment Blogs …… Labels: World U. S. Fashion Style Travel …… How to transfer when class labels are different? in number and meaning Transfer across Different Class Labels
6 Two Main Categories of Transfer Learning Unsupervised Transfer Learning –Do not have any labeled data from the target domain. –Use source domain to help learning. –Question: is it better than clustering? Supervised Transfer Learning –Have limited number of labeled examples from target domain –Is it better than not using any source data example?
7 Two sub-problems: –(1) What and how to transfer, since we can not explicitly use P(x|y) or P(y|x) to build the similarity among tasks (class labels ‘y’ have different meanings)? –(2) How to avoid negative transfer since the tasks may be from very different domains? Negative Transfer: when the tasks are too different, transfer learning may hurt learning accuracy. Transfer across Different Class Labels
8 The proposed solution (1) What and How to transfer? –Transfer the eigensapce Eigenspace: space expended by a set of eigen vectors. Dataset exhibits complex cluster shapes K-means performs very poorly in this space due bias toward dense spherical clusters. In the eigenspace (space given by the eigenvectors), clusters are trivial to separate. -- Spectral Clustering
9
10 (2) How to avoid negative transfer? –A new clustering-based KL Divergence to reflect distribution differences. –If distributions are too different (KL is large), automatically decrease the effect from source domain. The proposed solution Traditional KL Divergence Need to solve P(x), Q(x) for every x, which is normally difficult to obtain. To get the Clustering-based KL divergence: (1) Perform Clustering on the combined dataset. (2) Calculate the KL divergence by some basic statistical properties of the clusters. See Example.Example
11 An Example P Q C1 C2 Clustering S(P’, C1) S(Q’, C1) S(P’, C2) S(Q’, C2) Combined Dataset For example, S(P’, C) means “the portion of examples in P that are contained in cluster C ”. = 0.5 the portion of examples in P that are contained in cluster C1 the portion of examples in Q that are contained in cluster C1 = 0.5 =5/9 =4/9 the portion of examples in P that are contained in cluster C2 the portion of examples in Q that are contained in cluster C2 E(P)=8/15 E(Q)=7/15 P’(C1)=3/15 Q’(C1)=3/15 P’(C2)=5/15 Q’(C2)=4/15 KL=0.0309
12 Objective Function Objective: Find an eigenspace that well separates the target data –Intuition: If the source data is similar to the target data, make good use of the source eigenspace; –Otherwise, keep the original structure of the target data. Prefer Source Eigenspace Prefer Original Structure Balanced by R(L; U) More similar of distributions, less is R(L; U), more the function will rely on source eigenspace TL Traditional Normalized Cut Penalty Term
13 How to construct constraint TL and Tu? Principle: –To construct TL --- it is directly derived from the “must-link” constraint (the examples with the same label should be together). –To construct TU --- (1) Perform standard spectral clustering (e.g., Ncut) on U. (2) the examples in the same cluster should be together , 2, 4 should be together (blue); 3, 5, 6 should be together (red) , 2, 3 should be together; 4, 5, 6 should be together
14 How to construct constraint TL and Tu? Construct the constraint matrix M=[m1, m2, …, mr]’ For example, , -1, 0, 0, 0, 0 1, 0, 0, -1, 0, 0 0, 0, 1, 0, -1, 0 …… T ML = 1 and 2 1 and 4 3 and 5
15 Experiment Data sets
16 Experiment data sets
17 Text Classification Comp1 VS Rec1 1: comp2 VS Rec2 2: 4 classes (Graphics, etc) 3: 3 classes (crypt, etc) 1: org2 VS People2 2: 3 classes (Places, etc) 3: 3 classes (crypt, etc) Org1 VS People1
18 Image Classification Homer VS Real Bear Cartman VS Fern 1: Superman VS Teddy 2: 3 classes (cartman, etc) 3: 4 classes (laptop, etc) 1: Superman VS Bonsai 2: 3 classes (homer, etc) 3: 4 classes (laptop, etc)
19 Parameter Sensitivity
20 Problem: Transfer across tasks with different class labels Two sub-problems: (1) What and How to transfer? Transfer the eigenspace. (2) How to avoid negative transfer? Propose an effective clustering-based KL Divergence; if KL is large, or distributions are too different, decrease the effect from source domain. Conclusions
21 Thanks! Datasets and codes:
22 # Clusters? Condition for Lemma 1 to be valid: In each cluster, the expected values of the target and source data are about the same. >If Adaptively Control the #Clusters to guarantee Lemma 1 valid! --Stop bisecting clustering when there is only target/source data in the cluster, or where is close to 0.
23 Optimization Let Algorithm flow Then,