Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transfer Learning with Applications to Text Classification Jing Peng Computer Science Department.

Similar presentations


Presentation on theme: "Transfer Learning with Applications to Text Classification Jing Peng Computer Science Department."— Presentation transcript:

1 Transfer Learning with Applications to Text Classification Jing Peng Computer Science Department

2 Machine learning: study of algorithms that ① improve performance P ② on some task T ③ using experience E Well defined learning task:

3 Learning to recognize targets in images:

4 Learning to classify text documents:

5 Learning to build forecasting models:

6 Growth of Machine Learning Machine learning is preferred approach to ① Speech processing ② Computer vision ③ Medical diagnosis ④ Robot control ⑤ News articles processing ⑥ … This machine learning niche is growing ① Improved machine learning algorithms ② Lots of data available ③ Software too complex to code by hand ④ …

7 Learning Given Least squares methods Learning focuses on minimizing :approximation error H :estimation error

8 Main Challenge: 1.Transfer learning 2.High Dimensional (4000 features) 3.Overlapping (<80% features are the same) 4.Solution with performance bounds Transfer Learning with Applications to Text Classification

9 Standard Supervised Learning New York Times training (labeled)‏ test (unlabeled)‏ Classifier 85.5% New York Times

10 In Reality…… New York Times training (labeled)‏ test (unlabeled)‏ Classifier 64.1% New York Times Labeled data not available! Reuters

11 Domain Difference  Performance Drop traintest NYT New York Times Classifier 85.5% Reuters NYT ReutersNew York Times Classifier 64.1% ideal setting realistic setting

12 High Dimensional Data Transfer High Dimensional Data: Text Categorization Image Classification The number of features in our experiments is more than 4000 Challenges: High dimensionality. more than training examples Euclidean distance becomes meaningless

13 Why Dimension Reduction? DMAX DMIN

14 Curse of Dimensionality Dimensions

15 Dimensions

16 High Dimensional Data Transfer High Dimensional Data: Text Categorization Image Classification The number of features in our experiments is more than 4000 Challenges: High dimensionality. more than training examples Euclidean distance becomes meaningless Feature sets completely overlapping? No. Some less than 80% features are the same. Marginally not so related? Harder to find transferable structures Proper similarity definition.

17 PAC (Probably Approximately Correct) learning requirement  Training and test distributions must be the same

18 Transfer between high dimensional overlapping distributions Overlapping Distributions Data from two domains may not come from the same part of space; potentially overlap at best.

19 Transfer between high dimensional overlapping distributions Overlapping Distribution A?10.2+1 Data from two domains may not come from the same part of space; potentially overlap at best. B0.09?0.1+1 C0.01?0.3 xyzlabel

20 Transfer between high dimensional overlapping distributions Overlapping Distribution A?10.2+1 Data from two domains may not come from the same part of space; potentially overlap at best. B0.09?0.1+1 C0.01?0.3 xyzlabel

21 Transfer between high dimensional overlapping distributions Overlapping Distribution A?10.2+1 Data from two domains may not be lying on exactly the same space, but at most an overlapping one. B0.09?0.1+1 C0.01?0.3 xyzlabel

22 Transfer between high dimensional overlapping distributions Overlapping Distribution A?10.2+1 Data from two domains may not be lying on exactly the same space, but at most an overlapping one. B0.09?0.1+1 C0.01?0.3 xyzlabel

23 Problems with overlapping distributions Overlapping features alone may not provide sufficient predictive power Transfer between high dimensional overlapping distributions

24 Problems with overlapping distributions Overlapping features alone may not provide sufficient predictive power Transfer between high dimensional overlapping distributions A?10.2+1 B0.09?0.1+1 C0.01?0.3 f1f2f3label

25 Problems with overlapping distributions Overlapping features alone may not provide sufficient predictive power Transfer between high dimensional overlapping distributions A?10.2+1 B0.09?0.1+1 C0.01?0.3 f1f2f3label

26 Problems with overlapping distributions Overlapping features alone may not provide sufficient predictive power Transfer between high dimensional overlapping distributions A?10.2+1 B0.09?0.1+1 C0.01?0.3 f1f2f3label Hard to predict correctly

27 Overlapping Distributions Use the union of all features and fill in missing values with “zeros”? Transfer between high dimensional overlapping distributions

28 Overlapping Distributions Use the union of all features and fill in missing values with “zeros”? Transfer between high dimensional overlapping distributions A010.2+1 B0.0900.1+1 C0.0100.3 f1f2f3label

29 Overlapping Distribution Use the union of all features and fill in the missing values with “zeros”? Transfer between high dimensional overlapping distributions A010.2+1 B0.0900.1+1 C0.0100.3 f1f2f3label Does it helps?

30 Transfer between high dimensional overlapping distributions

31 D 2 { A, B} = 0.0181 > D 2 {A, C} = 0.0101

32 Transfer between high dimensional overlapping distributions D 2 { A, B} = 0.0181 > D 2 {A, C} = 0.0101 A is mis-classified as in the class of C, instead of B

33 Transfer between high dimensional overlapping distributions When one uses the union of overlapping and non-overlapping features and replaces missing values with “zero”, distance of two marginal distributions p(x) can become asymptotically very large as a function of non-overlapping features: becomes a dominant factor in similarity measure.

34 High dimensionality can underpin important features Transfer between high dimensional overlapping distributions

35

36 The “blues” are closer to the “greens” than to the “reds”

37 LatentMap: two step correction Missing value regression Bring marginal distributions closer Latent space dimensionality reduction Further bring marginal distributions closer Ignore non-important noisy and “error imported features” Identify transferable substructures across two domains.

38 Predict missing values (recall the previous example) Missing Value Regression

39 Predict missing values (recall the previous example) Missing Value Regression

40 Predict missing values (recall the previous example) Missing Value Regression 1. Project to overlapped feature

41 Predict missing values (recall the previous example) Missing Value Regression 1. Project to overlapped feature 2. Map from z to x Relationship found by regression

42 Predict missing values (recall the previous example) Missing Value Regression 1. Project to overlapped feature 2. Map from z to x Relationship found by regression

43 Predict missing values (recall the previous example) Missing Value Regression 1. Project to overlapped feature 2. Map from z to x Relationship found by regression D { img(A’), B} = 0.0109 < D {img(A’), C} = 0.0125

44 Predcit missing values (recall the previous example) Missing Value Regression 1. Project to overlapped feature 2. Map from z to x Relationship found by regression D { img(A’), B} = 0.0109 < D {img(A’), C} = 0.0125 A is correctly classified as in the same class as B

45 Dimensionality Reduction

46 Missing Values

47 Dimensionality Reduction Overlapping Features Missing Values

48 Dimensionality Reduction Missing Values Filled Overlapping Features Missing Values

49 Dimensionality Reduction Missing Values Filled Overlapping Features Missing Values Word vector Matrix

50 Dimensionality Reduction Project the word vector matrix to the most important and inherent sub-space

51 Dimensionality Reduction Project the word vector matrix to the most important and inherent sub-space

52 Dimensionality Reduction Project the word vector matrix to the most important and inherent sub-space Low dimensional representation

53 Solution (high dimensionality) recall the previous example

54 Solution (high dimensionality) recall the previous example

55 Solution (high dimensionality) recall the previous example The blues are closer to the greens than to the reds

56 Solution (high dimensionality) recall the previous example

57 Solution (high dimensionality) The blues are closer to the reds than to the greens recall the previous example

58 Properties It can bring the marginal distributions of two domains closer. - Marginal distributions are brought closer in high- dimensional space( section 3.2 ) - Two marginal distributions are further minimized in low dimensional space. ( theorem 3.2 ) It brings two domains conditional distributions closer. - Nearby instances from two domains have similar conditional distributions ( section 3.3 ) It can reduce domain transfer risk - The risk of nearest neighbor classifier can be bounded in transfer learning settings. ( theorem 3.3 )

59 Experiment (I)‏ Data Sets 20 News Groups 20000 newsgroup articles SRAA (simulated real auto aviation) 73128 articles from 4 discussion groups (simulated auto racing, simulated aviation, real autos, and real aviation) Reuters 21758 Reuters news articles (1987)

60 Experiment (I)‏ Data Sets 20 News Groups 20000 newsgroup articles SRAA (simulated real auto aviation) 73128 articles from 4 discussion groups (simulated auto racing, simulated aviation, real autos, and real aviation) Reuters 21758 Reuters news articles (1987) First fill up the “GAP”, then use knn classifier to do classification 20 News groups comp comp.sys comp.graphics rec rec.sport rec.auto Out-Domain In-Domain

61 Experiment (I)‏ Data Sets 20 News Groups 20000 newsgroup articles SRAA (simulated real auto aviation) 73128 articles from 4 discussion groups (simulated auto racing, simulated aviation, real autos, and real aviation) Reuters 21758 Reuters news articles (1987) Baseline methods naïve Bayes, logistic regression, SVMs Knn-Reg: missing value filled without SVD pLatentMap: SVD but missing value as 0

62 Experiment (I)‏ Data Sets 20 News Groups 20000 newsgroup articles SRAA (simulated real auto aviation) 73128 articles from 4 discussion groups Reuters 21758 Reuters news articles Baseline methods naïve Bayes, logistic regression, SVM Knn-Reg: missing value filled without SVD pLatentMap: SVD but missing value as 0 Try to justify the two steps in our framework

63 Learning Tasks

64 Experiment (II)‏ 10 win 1 loss Overall performance

65 Experiment (III)‏ knnReg: Missing values filled but without SVD Compared with knnReg 8 win 3 loss pLatentMap: SVD but without filling missing values Compared with pLatentMap 8 win 3 loss

66 Conclusion Problem: High dimensional overlapping domain transfer -– text and image categorization Step 1: Missing values filling up --- Bring two domains’ marginal distributions closer Step 2: SVD dimension reduction --- Further b ring two marginal distributions closer (Theorem 3.2) --- Cluster points from two domains, making conditional distribution transferable. (Theorem 3.3

67


Download ppt "Transfer Learning with Applications to Text Classification Jing Peng Computer Science Department."

Similar presentations


Ads by Google