Web Taxonomy Integration through Co-Bootstrapping Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR’04.

Slides:



Advertisements
Similar presentations
An Introduction to Boosting Yoav Freund Banter Inc.
Advertisements

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
On-line learning and Boosting
Data Mining and Machine Learning
AdaBoost Reference Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal.
ICML Linear Programming Boosting for Uneven Datasets Jurij Leskovec, Jožef Stefan Institute, Slovenia John Shawe-Taylor, Royal Holloway University.
Boosting Approach to ML
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Longin Jan Latecki Temple University
Review of : Yoav Freund, and Robert E
Merging Taxonomies. Assertion Creation and maintenance of large ontologies will require the capability to merge taxonomies This problem is similar to.
Mapping Between Taxonomies Elena Eneva 30 Oct 2001 Advanced IR Seminar.
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
2D1431 Machine Learning Boosting.
Mapping Between Taxonomies Elena Eneva 27 Sep 2001 Advanced IR Seminar.
A Brief Introduction to Adaboost
Ensemble Learning: An Introduction
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.
Adaboost and its application
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.
CSSE463: Image Recognition Day 31 Due tomorrow night – Project plan Due tomorrow night – Project plan Evidence that you’ve tried something and what specifically.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Machine Learning: Ensemble Methods
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Data mining and machine learning A brief introduction.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:
CS 391L: Machine Learning: Ensembles
Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Learning with AdaBoost
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Catalog Integration R. Agrawal, R. Srikant: WWW-10.
The Viola/Jones Face Detector A “paradigmatic” method for real-time object detection Training is slow, but detection is very fast Key ideas Integral images.
CSSE463: Image Recognition Day 33 This week This week Today: Classification by “boosting” Today: Classification by “boosting” Yoav Freund and Robert Schapire.
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Machine Learning: Ensemble Methods
Semi-Supervised Clustering
Reading: R. Schapire, A brief introduction to boosting
The Boosting Approach to Machine Learning
The Boosting Approach to Machine Learning
Combining Base Learners
Boosting Nearest-Neighbor Classifier for Character Recognition
Adaboost Team G Youngmin Jun
An Inteligent System to Diabetes Prediction
Web Taxonomy Integration through Co-Bootstrapping
ADABOOST(Adaptative Boosting)
Model Combination.
Presentation transcript:

Web Taxonomy Integration through Co-Bootstrapping Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR’04

Introduction

Problem Statement Games > Roleplaying Final Fantasy Fan Dragon Quest Home Games > Strategy Shogun: Total War Games > Online EverQuest Addict Warcraft III Clan Games > Single-Player Warcraft III Clan Games > Roleplaying Final Fantasy Fan Dragon Quest Home EverQuest Addict Warcraft III Clan Games > Strategy Shogun: Total War Warcraft III Clan

Possible Approach Games > Roleplaying Final Fantasy Fan Dragon Quest Home Games > Strategy Shogun: Total War Train EverQuest Addict Warcraft III Clan Classify ignores original Yahoo! categories

Another Approach (1/2) Use Yahoo! categories Advantage similar categories Potential Problem different structure categories do not match exactly

Another Approach (2/2) Example: Crayon Shin-chan Entertainment > Comics and Animation > Animation > Anime > Titles > Crayon Shin-chan Arts > Animation > Anime > Titles > C > Crayon Shin-chan

This Paper’s Approach 1.Weak Learner (as opposed to Naïve Bayes) 2.Boosting to combine Weak Hypotheses 3.New Idea: Co-Bootstrapping to exploit source categories

Assumptions Multi-category data are reduced to binary data Totoro FanCartoon > My Neighbor Totoro Toys > My Neighbor Totoro is converted into Totoro FanCartoon > My Neighbor Totoro Totoro FanToys > My Neighbor Totoro Hierarchies are ignored Console > Sega and Console > Sega > Dreamcast are not related

Weak Learner 1.Weak Learner 2.Boosting 3.Co-Bootstrapping

Weak Learner A type of classifier similar to Naïve Bayes + = accept - = reject Term may be a word or n-gram or … Weak Learner Weak Hypothesis (term-based classifier) After Training

Weak Hypothesis Example contain “Crayon Shin-chan”  in “Comics > Crayon Shin-chan” not in “Education > Early Childhood” not contain “Crayon Shin-chan”  not in “Comics > Crayon Shin-chan” in “Education > Early Childhood”

Weak Learner Inputs (1/2) Training data are in the form [x 1, y 1 ], [x 2, y 2 ], …, [x m, y m ] x i is a document y i is a category [x i, y i ] means document x i is in category y i D(x, y) is a distribution over all combinations of x i and y i D(x i, y j ) indicates the “importance” of (x i, y j ) w is the term (automatically found)

Weak Learner Algorithm For each possible category y, compute four values: Note: (x i,y) with greater D (x i,y) has more influence.

Weak Hypothesis h(x, y) Given unclassified document x and category y If x contains w, then Else if x does not contain w, then

Weak Learner Comments If sign[ h(x,y) ] = +, then x is in y | h(x,y) | is the confidence The term w is found as follows: Repeatedly run weak learner for all possible w Choose the run with the smallest value as the model Boosting: Minimizes probability of h(x,y) having wrong sign

Boosting (AdaBoost.MH) 1.Weak Learner 2.Boosting 3.Co-Bootstrapping

Boosting Idea 1.Train the weak learner on different D t (x, y) distributions 2.After each run, adjust D t (x, y) by putting more weight on the most often misclassified training data 3.Output the final hypothesis as a linear combination of weak hypotheses

Boosting Algorithm Given: [x 1, y 1 ], [x 2, y 2 ], …, [x m, y m ], where x i  X and y i  Y Initialize D 1 (x,y) = 1/(mk) for t = 1,…,T do Pass distribution D t to weak learner Get weak hypothesis h t (x, y) Choose  t  R Update end for Output the final hypothesis

Boosting Algorithm Initialization Given: [x 1, y 1 ], [x 2, y 2 ], …, [x m, y m ] Initialize D(x, y) = 1/(mk) k = total number of categories uniform distribution

Boosting Algorithm Loop for t = 1,…,T do Run weak learner using distribution D Get weak hypothesis h t (x, y) For each possible pair (x,y) in training data If h t (x,y) guesses incorrectly, increase D(x,y) end for return

Co-Bootstrapping 1.Weak Learner 2.Boosting 3.Co-Bootstrapping

Co-Bootstrapping Idea We want to use Yahoo! categories to increase classification accuracy

Recall Example Problem Games > Online EverQuest Addict Warcraft III Clan Games > Single-Player Warcraft III Clan Games > Roleplaying Final Fantasy Fan Dragon Quest Home Games > Strategy Shogun: Total War

Co-Bootstrapping Algorithm (1/4) 1. Run AdaBoost on Yahoo! sites Get classifier Y1 2. Run AdaBoost on Google sites Get classifier G1 3. Run Y1 on Google sites Get predicted Yahoo! categories for Google sites 4. Run G1 on Yahoo! sites Get predicted Google categories for Yahoo! sites

Co-Bootstrapping Algorithm (2/4) 5. Run AdaBoost on Yahoo! sites Include Google category as a feature Get classifier Y2 6. Run AdaBoost on Google sites Include Yahoo! category as a feature Get classifier G2 7. Run Y2 on original Google sites get more accurate Yahoo! categories for Google sites 8. Run G2 on original Yahoo! sites get more accurate Google categories for Yahoo! sites

Co-Bootstrapping Algorithm (3/4) 9. Run AdaBoost on Yahoo! sites Include Google category as a feature Get classifier Y3 10. Run AdaBoost on Google sites Include Yahoo! category as a feature Get classifier G3 11. Run Y3 on original Google sites get even more accurate Yahoo! categories for Google sites 12. Run G3 on original Yahoo! sites get even more accurate Google categories for Yahoo! sites

Co-Bootstrapping Algorithm (4/4) Repeat, repeat, and repeat… Hopefully, the classification will become more accurate after each iteration…

Enhanced Naïve Bayes (Benchmark)

Enhanced Naïve Bayes (1/2) Given document x source category S of x Predict master category C In NB, Pr[C | x]  Pr[C]  w  x (Pr[w | C]) n(x,w) w : word n(x,w) number of occurrences of w in x Pr[C | x, S]  Pr[C | S]  w  x (Pr[w | C]) n(x,w)

Enhanced Naïve Bayes (2/2) Pr[C] = Estimate Pr[C | S]  |C  S| : number of docs in S that is classified into C by NB classifier

Experiment

Datasets GoogleYahoo! Book /Top/Shopping/ Publications/Books /Business and Economy/Shopping and Services/Books//Bookstores Disease /Top/Health/Conditions and Diseases /Health/Diseases and Conditions Movie /Top/Arts/Movies/Genres/Entertainment/Movies and Film/Genres/ Music /Top/Arts/Music/Styles/Entertainment/Music/Genres News /Top/News/By Subject/News and Media

Number of Categories*/Dataset (1/2) GoogleYahoo! Book4941 Disease3051 Movie3425 Music4724 News2734 *Top level categories only

Number of Categories*/Dataset (2/2) Book Horror Science Fiction Non-fiction Biography History Merge into Non-fiction

Number of Websites GoogleYahoo! GYGYGYGY Book10,84211,26821, Disease34,0479,78541,4392,393 Movie36,78714,36649,7441,409 Music76,42024,51895,9714,967 News31,50419,41949,3031,620

Method (1/2) Classify Yahoo! Book websites into Google Book categories (G  Y) 1.Find G  Y for Book 2.Hide Google categories for in G  Y 3.G  Y  Yahoo! Book 4.Randomly take |G  Y| sites from G-Y  Google Book

Method (2/2) For each dataset, do G  Y five times and G  Y five times macro F-score : calculate F-score for each category, then average over all categories micro F-score : calculate F-score on the entire dataset recall = 100%? Doesn’t say anything about multi-category ENB

Results (1/3) Co-Boostrapping-AdaBoost > AdaBoost macro-averaged F scoresmicro-averaged F scores

Results (2/3) Co-Bootstrapping-AdaBoost iteratively improves AdaBoost Book Dataset

Results (3/3) Co-Boostrapping-AdaBoost > Enhanced Na ï ve Bayes macro-averaged F scoresmicro-averaged F scores

Contribution Co-Bootstrapping improves Boosting performance Does not require  as in ENB