Download presentation
Presentation is loading. Please wait.
Published byAngelica Grace Craig Modified over 9 years ago
1
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 2010@Room 309
2
Introduction Calculating Page Similarity Finding Similar Pages ◦ Click Data Model (CDM) ◦ Query Constraint (QC) algorithm Experimental Results Discussion Conclusion 2
3
Large labor cost of annotating the data The aggregated click data across many users over time provides valuable information Leveraging click logs to argument training data by propagating class labels to unlabeled similar documents 3
4
“Two pages that tend to be clicked by the same user queries tend to be topically similar” 4 AB “How to tie a tie” “How to tie a neck tie knots ” “Tying a tie” Label as “Positive” (class “How-to”) Unknown Label “Positive” ?
5
A page is represented as a node in the similar graph Normalize all the URLs e.g. the following 4 URLs are treated as the same (1) “http://www.acm.org” (2) “www.acm.org” (3) “www.acm.org/” (4) “http://www.acm.org/” 5
6
Each URL is represented as a vector of queries that users issued and clicked through to the page 6 Pantel & Lin (2002)
7
Compute the similarity between two pages using the cosine similarity of their respective feature vector sim (p1,p2) > sim (p1,p3) sim (p1,p2) > sim(p2,p3) Because p1 and p2 share more common queries than p3 7
8
What’s a “seed set” ? A set of some labeled data Two algorithms for seed set expansion ◦ Click Data Model (CDM) ◦ Query Constraints (QC) algorithm 8
9
Two phases ◦ Updating score phase ◦ Filtering phase Input ◦ S1 (positive set) ◦ S2 (negative set) ◦ G (click graph) Output ◦ E1 (positive) ◦ E2 (negative) Thresholds ◦ 0.1<T 1 <0.6 ◦ 0.6<T 2 <1.2 9
10
Additional Module that checks whether the common queries between two nodes have certain term patterns 10
11
Reduce the amount of human annotation effort by leveraging the click data Build an expansion model with labeled training data and use it to select next round of training data 11
12
Click Data ◦ During December 2008 from Yahoo! Search engine ◦ Only the top 10 URLs are considered ◦ URLs with less than 10 clicks are excluded Tree classification tasks ◦ How-to ◦ Adult ◦ review 12
13
Training sets ◦ 10,000 manually labeled positive and negative examples ◦ For “review” classifier, queries such as “digital camera reviews” or “baby swing reviews” ◦ For “How-to” classifier, queries such as “how to clean uggs” or “best way to loose weight” Testing sets 13
14
Classifier ◦ Gradient Boosting Decision Tree (GBDT) Features ◦ Textual, Link, URL, HTML, Other features Metrics ◦ Area Under the ROC Curve (AUC) ( Fawcett, 2003 ) ◦ F score ◦ Accuracy 14
15
The big improvement of CDM is observed with a model using 5000 labeled data as a seed set (+1.07% in F-score, +0.81 in Accuracy and +0.25% in AUC) 15
16
Reduce the manual labor by 50% QC (exclude pages that do not have “review” in query terms) is useful when labeled data is small 16
17
With 1000 and 2000 human labeled data, CDM performs worse than the baseline QC (exclude pages that do not have “How-to” in query terms) 17
18
Baseline: Type A CDM: Type C 18
19
From “How-to” Classifier Seed 1 Seed 2 (human label from Expnd1) Expand2 19
20
A random sample of 50 positive and 50 negative example from “how-to” classifier Positive class has 82.3% precision whereas negative class has 83.6% precision 20
21
Is the proposed method always useful for web page classification ? How can we improve the quality of automatically labeled data from unlabeled data ? 21
22
Present a method for improve webpage classification by leveraging click data to augment training data Argument manually labeled data by modeling the similarity between pages in a click graph 22
23
Thank you very much Questions & Answers 23
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.