Using Link Information to Enhance Web Page Classification Xiaoguang Qi, Brian D. Davison
Introduction Web page classification is important Browsing information through topics Query result tagging Finding similar documents Clustering query results Applying textual classifiers on web data Not satisfying
Our Approach Using information of neighboring pages to help judge a page’s topic Four kinds of neighbors Parents, children, siblings, co-spouses Neighboring pages may have been labeled Appearing in web hierarchies Use the labels is available Pages without existing labels: use a classifier
Other Considerations Are the four kinds of neighbors equally important? Give them different weights β = (β1, β2, β3, β4) The use of classifier may introduce noise Down-weight the results of classifier: η 0≤η≤1
Other Considerations (Cont.) Do intra-host links count? They are often down-weighted or ignored in link-based ranking Web page classification is a different scenario Give it a weight: θ (θ =0,1) Counting the multiple paths Siblings may have multiple parents in common Weighted path version vs. unweighted path version
Other Considerations (Cont.) Combining neighbors with the start page Weighted average: α (0≤α≤1 ) α* start page+(1- α)*neighbors
Experimental Setup 12 top-level categories in DMoz Directory 19,000 document from each category to train the text classifier 1,000 for testing Get incoming links by querying Yahoo API
Parameter Tuning
Parameter Tuning (Cont.)
Parameter Tuning (Cont.)
Parameter Tuning (Cont.)
Experimental Results Best performance is achieved at the settings: α=0.2, β= (0, 0, 1, 0), η=0, θ=1, weighted path version
Experimental Results (Cont.) “DMoz copy effect” We are benefiting from it! It may affect the optimal parameter setting Solution Remove the pages whose URL contains directory names of DMoz E.g. “Computers/Hardware”, “Business/Employment”
Conclusion Improved the accuracy of web page classification Explored the effects of a number of parameters
Future Work Is the parameter tuning independent of dataset? Is our dataset representative of the web? Other classifiers What’s the effect of the number of categories and the granularity of the categories