Download presentation
Presentation is loading. Please wait.
Published byสมหมาย เก่งงาน Modified over 5 years ago
1
Using Link Information to Enhance Web Page Classification
Xiaoguang Qi, Brian D. Davison
2
Introduction Web page classification is important
Browsing information through topics Query result tagging Finding similar documents Clustering query results Applying textual classifiers on web data Not satisfying
3
Our Approach Using information of neighboring pages to help judge a page’s topic Four kinds of neighbors Parents, children, siblings, co-spouses Neighboring pages may have been labeled Appearing in web hierarchies Use the labels is available Pages without existing labels: use a classifier
4
Other Considerations Are the four kinds of neighbors equally important? Give them different weights β = (β1, β2, β3, β4) The use of classifier may introduce noise Down-weight the results of classifier: η 0≤η≤1
5
Other Considerations (Cont.)
Do intra-host links count? They are often down-weighted or ignored in link-based ranking Web page classification is a different scenario Give it a weight: θ (θ =0,1) Counting the multiple paths Siblings may have multiple parents in common Weighted path version vs. unweighted path version
6
Other Considerations (Cont.)
Combining neighbors with the start page Weighted average: α (0≤α≤1 ) α* start page+(1- α)*neighbors
7
Experimental Setup 12 top-level categories in DMoz Directory
19,000 document from each category to train the text classifier 1,000 for testing Get incoming links by querying Yahoo API
8
Parameter Tuning
9
Parameter Tuning (Cont.)
10
Parameter Tuning (Cont.)
11
Parameter Tuning (Cont.)
12
Experimental Results Best performance is achieved at the settings:
α=0.2, β= (0, 0, 1, 0), η=0, θ=1, weighted path version
13
Experimental Results (Cont.)
“DMoz copy effect” We are benefiting from it! It may affect the optimal parameter setting Solution Remove the pages whose URL contains directory names of DMoz E.g. “Computers/Hardware”, “Business/Employment”
14
Conclusion Improved the accuracy of web page classification
Explored the effects of a number of parameters
15
Future Work Is the parameter tuning independent of dataset?
Is our dataset representative of the web? Other classifiers What’s the effect of the number of categories and the granularity of the categories
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.