Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Link Information to Enhance Web Page Classification

Similar presentations


Presentation on theme: "Using Link Information to Enhance Web Page Classification"— Presentation transcript:

1 Using Link Information to Enhance Web Page Classification
Xiaoguang Qi, Brian D. Davison

2 Introduction Web page classification is important
Browsing information through topics Query result tagging Finding similar documents Clustering query results Applying textual classifiers on web data Not satisfying

3 Our Approach Using information of neighboring pages to help judge a page’s topic Four kinds of neighbors Parents, children, siblings, co-spouses Neighboring pages may have been labeled Appearing in web hierarchies Use the labels is available Pages without existing labels: use a classifier

4 Other Considerations Are the four kinds of neighbors equally important? Give them different weights β = (β1, β2, β3, β4) The use of classifier may introduce noise Down-weight the results of classifier: η 0≤η≤1

5 Other Considerations (Cont.)
Do intra-host links count? They are often down-weighted or ignored in link-based ranking Web page classification is a different scenario Give it a weight: θ (θ =0,1) Counting the multiple paths Siblings may have multiple parents in common Weighted path version vs. unweighted path version

6 Other Considerations (Cont.)
Combining neighbors with the start page Weighted average: α (0≤α≤1 ) α* start page+(1- α)*neighbors

7 Experimental Setup 12 top-level categories in DMoz Directory
19,000 document from each category to train the text classifier 1,000 for testing Get incoming links by querying Yahoo API

8 Parameter Tuning

9 Parameter Tuning (Cont.)

10 Parameter Tuning (Cont.)

11 Parameter Tuning (Cont.)

12 Experimental Results Best performance is achieved at the settings:
α=0.2, β= (0, 0, 1, 0), η=0, θ=1, weighted path version

13 Experimental Results (Cont.)
“DMoz copy effect” We are benefiting from it! It may affect the optimal parameter setting Solution Remove the pages whose URL contains directory names of DMoz E.g. “Computers/Hardware”, “Business/Employment”

14 Conclusion Improved the accuracy of web page classification
Explored the effects of a number of parameters

15 Future Work Is the parameter tuning independent of dataset?
Is our dataset representative of the web? Other classifiers What’s the effect of the number of categories and the granularity of the categories


Download ppt "Using Link Information to Enhance Web Page Classification"

Similar presentations


Ads by Google