COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI
RANDOM SUBSPACE METHOD (RSM) Proposed by Ho Proposed by Ho “The Random Subspace for Constructing Decision Forests”, 1998 Another combining technique for weak classifiers like Bagging, Boosting. Another combining technique for weak classifiers like Bagging, Boosting.
RSM ALGORITHM 1. Repeat for b = 1, 2,..., B: (a) Select an r-dimensional random subspace X from the original p-dimensional feature space X. 2. Combine classifiers C b (x), b = 1, 2,..., B, by simple majority voting to a final decision rule
MOTIVATION FOR RSM Redundancy in Data Feature Space Redundancy in Data Feature Space Completely redundant feature set Completely redundant feature set Redundancy is spread over many features Redundancy is spread over many features Weak classifiers that have critical training sample sizes Weak classifiers that have critical training sample sizes
RSM PERFORMANCE ISSUES RSM Performance depends on: RSM Performance depends on: Training sample size Training sample size The choice of a base classifier The choice of a base classifier The choice of combining rule (simple majority vs. weighted) The choice of combining rule (simple majority vs. weighted) The degree of redundancy of the dataset The degree of redundancy of the dataset The number of features chosen The number of features chosen
DECISION FORESTS (by Ho) A combination of trees instead of a single tree A combination of trees instead of a single tree Assumption: Dataset has some redundant features Assumption: Dataset has some redundant features Works efficiently with any decision tree algorithm and data splitting method Works efficiently with any decision tree algorithm and data splitting method Ideally, look for best individual trees with lowest tree similarity Ideally, look for best individual trees with lowest tree similarity
UNLABELED DATA Small number of labeled documents Small number of labeled documents Large pool of unlabeled documents Large pool of unlabeled documents How to classify unlabeled documents accurately? How to classify unlabeled documents accurately?
EXPECTATION-MAXIMIZATION (E-M)
CO-TRAINING Blum and Mitchel, “Combining Labeled and Unlabeled Data with Co-Training”, Blum and Mitchel, “Combining Labeled and Unlabeled Data with Co-Training”, Requirements: Requirements: Two sufficiently strong feature sets Two sufficiently strong feature sets Conditionally independent Conditionally independent
CO-TRAINING
APPLICATION OF CO-TRAINING TO A SINGLE FEATURE SET Algorithm: Obtain a small set L of labeled examples Obtain a large set U of unlabeled examples Obtain two sets F 1 and F 2 of features that are sufficiently redundant While U is not empty do: Learn classifier C 1 from L based on F 1 Learn classifier C 2 from L based on F 2 For each classifier C i do: C i labels examples from U based on F i C i chooses the most confidently predicted examples E from U E is removed from U and added (with their given labels) to L End loop
THINGS TO DO How can we measure redundancy and use it efficiently? How can we measure redundancy and use it efficiently? Can we improve Co-training? Can we improve Co-training? How can we apply RSM efficiently to: How can we apply RSM efficiently to: Supervised learning Supervised learning Semi-supervised learning Semi-supervised learning Unsupervised learning Unsupervised learning
QUESTIONS ???????????????????????????????????????????????????? ???????????????????????????????????????????????????? ???????????????????????????????????????????????????? ???????????????????????????????????????????????????? ???????????????????????????????????????????????????? ???????????????????????????????????????????????????? ???????????????????????????????????????????????????? ????????????????????????????????????????????????????