Text Learning Tom M. Mitchell Aladdin Workshop Carnegie Mellon University January 2003.

Slides:



Advertisements
Similar presentations
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Advertisements

Co Training Presented by: Shankar B S DMML Lab
Generalizing Backpropagation to Include Sparse Coding David M. Bradley and Drew Bagnell Robotics Institute Carnegie.
Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learning.
Combining Inductive and Analytical Learning Ch 12. in Machine Learning Tom M. Mitchell 고려대학교 자연어처리 연구실 한 경 수
CMPUT 466/551 Principal Source: CMU
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Multi-View Learning in the Presence of View Disagreement C. Mario Christoudias, Raquel Urtasun, Trevor Darrell UC Berkeley EECS & ICSI MIT CSAIL.
Final review LING572 Fei Xia Week 10: 03/13/08 1.
Chapter 5: Partially-Supervised Learning
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Presented by Zeehasham Rasheed
Holistic Web Page Classification William W. Cohen Center for Automated Learning and Discovery (CALD) Carnegie-Mellon University.
Graph-Based Semi-Supervised Learning with a Generative Model Speaker: Jingrui He Advisor: Jaime Carbonell Machine Learning Department
Bing LiuCS Department, UIC1 Chapter 8: Semi-Supervised Learning Also called “partially supervised learning”
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learing.
Introduction to machine learning
Semi-Supervised Learning
Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.
Ensembles of Classifiers Evgueni Smirnov
Final review LING572 Fei Xia Week 10: 03/11/
Semi-Supervised Learning over Text Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 2006 Modified by Charles Ling.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.
Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Machine Learning II Decision Tree Induction CSE 573.
HAITHAM BOU AMMAR MAASTRICHT UNIVERSITY Transfer for Supervised Learning Tasks.
Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004 Mihai Surdeanu.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Logistic Regression William Cohen.
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Advisor: Tom Mitchell.
Data Mining and Decision Support
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Learning from Labeled and Unlabeled Data Tom Mitchell Statistical Approaches to Learning and Discovery, and March 31, 2003.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Classification using Co-Training
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
Introduction to Machine Learning. Introduce yourself Why you choose this course? Any examples of machine learning you know?
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Combining Labeled and Unlabeled Data with Co-Training
Data Mining Lecture 11.
Machine Learning Today: Reading: Maria Florina Balcan
CSCI B609: “Foundations of Data Science”
Computational Learning Theory
Overview of Machine Learning
Computational Learning Theory
LECTURE 23: INFORMATION THEORY REVIEW
NAÏVE BAYES CLASSIFICATION
Presentation transcript:

Text Learning Tom M. Mitchell Aladdin Workshop Carnegie Mellon University January 2003

1. CoTraining learning from labeled and unlabeled data

Redundantly Sufficient Features Professor Faloutsos my advisor

Redundantly Sufficient Features Professor Faloutsos my advisor

Redundantly Sufficient Features

Professor Faloutsos my advisor

CoTraining Setting If –x1, x2 conditionally independent given y –f is PAC learnable from noisy labeled data Then –f is PAC learnable from weak initial classifier plus unlabeled data

Co-Training Rote Learner My advisor pages hyperlinks

Co-Training Rote Learner My advisor pages hyperlinks

Co-Training Rote Learner My advisor pages hyperlinks

Co-Training Rote Learner My advisor pages hyperlinks

Co-Training Rote Learner My advisor pages hyperlinks

What if CoTraining Assumption Not Perfectly Satisfied?

Idea: Want classifiers that produce a maximally consistent labeling of the data If learning is an optimization problem, what function should we optimize? What if CoTraining Assumption Not Perfectly Satisfied?

What Objective Function? Error on labeled examples

What Objective Function? Error on labeled examples Disagreement over unlabeled

What Objective Function? Error on labeled examples Disagreement over unlabeled Misfit to estimated class priors

What Function Approximators?

Same fn form as Naïve Bayes, Max Entropy Use gradient descent to simultaneously learn g1 and g2, directly minimizing E = E1 + E2 + E3 + E4 No word independence assumption, use both labeled and unlabeled data

Gradient CoTraining

Classifying Jobs for FlipDog X1: job title X2: job description

Gradient CoTraining Classifying FlipDog job descriptions: SysAdmin vs. WebProgrammer Final Accuracy Labeled data alone: 86% CoTraining: 96%

Gradient CoTraining Classifying Upper Case sequences as Person Names 25 labeled 5000 unlabeled 2300 labeled 5000 unlabeled Using labeled data only Cotraining Cotraining without fitting class priors (E4) * sensitive to weights of error terms E3 and E4.89 *.85 * *

CoTraining Summary Key is getting the right objective function –Class priors is an important term –Can min-cut algorithms accommodate this? And minimizing it… –Gradient descent local minima problems –Graph partitioning possible?

The Problem/Opportunity Must train classifier to be website-independent, but many sites exhibit website-specific regularities Question How can program learn website-specific regularities for millions of sites, without human labeling data?

Learn Local Regularities for Page Classification

1. Label site using global classifier

Learn Local Regularities for Page Classification 1. Label site using global classifier (cont educ page)

Learn Local Regularities for Page Classification 1. Label site using global classifier 2. Learn local classifiers

Learn Local Regularities for Page Classification CEd.html 1. Label site using global classifier 2. Learn local classifiers, CECourse(x) :- under(x, linkto(x, 1 < inDegree (x) < 4 globalConfidence(x) > 0.3 Music.html

Learn Local Regularities for Page Classification CEd.html 1. Label site using global classifier 2. Learn local classifiers, 3. Apply local classifier, to modify global labels Music.html

Learn Local Regularities for Page Classification CEd.html 1. Label site using global classifier 2. Learn local classifier 3. Apply local classifier, to modify global labels Music.html

Results of Local Learning: Cont.Education Course Page Learning global classifier only: –precision.81, recall.80 Learning global classifier plus site-specific classifiers for 20 local sites: –precision.82, recall.90

Learning Site-Specific Regularities: Example 2 Extracting “Course-Title” from web pages

Local/Global Learning Algorithm Train global course title extractor (word based) For each new university site: –Apply global title extractor –For each page containing extracted titles Learn page-specific rules for extracting titles, based on page layout structure Apply learned rules to refine initial labeling

X X

Local/Global Learning Summary Approach: –Learn global extractor/classifier using content features –Learn local extractor/classifier using layout features –Design restricted hypothesis language for local, to accommodate sparse training data Algorithm to process a new site: –Apply global extractor/classifier to label site –Train local extractor/classifier on this data –Apply local extractor/classifier to refine labels

Other Local Learning Approaches Rule covering algorithms: each rule a local model –But require supervised labeled data for each locality Shrinkage-based techniques, eg., for learning hospital-independent and hospital-specific models for medical outcomes –Again, requires labeled data for each hospital This is different – no labeled data for new sites

When/Why does this work?? Local and global models use independent, redundantly sufficient features Local models learned within low-dimension hypothesis language Related to co-training!

Other Uses? + Global and website-specific information extractors + Global and program-specific TV segment classifiers? + Global and environment-specific robot perception? –Global and speaker-specific speech recognition? –Global and hospital-specific medical diagnosis?

Summary Cotraining: –Classifier learning as minimization problem –Graph partitioning algorithm possible? Learning site-specific structure: –Important structure involves long-distance relationships –Strong local graph structure regularities are highly useful