Presentation is loading. Please wait.

Presentation is loading. Please wait.

HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.

Similar presentations


Presentation on theme: "HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00."— Presentation transcript:

1 HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00

2

3 “Standard” Approach  Apply traditional text learning algorithms  In many cases, goal is not to classify hypertext but to test the algorithms  Is it actually the right approach?

4 Results?  Mixed results Positive results in most cases BUT the goal was to test the algorithms Negative in few e.g. Chakrabarti BUT the goal was to motivate their own algorithm

5 How is hypertext different?  Link Information  Diverse Authorship  Short text - topic not obvious from the text  Structure / position within the web graph  Author-supplied features(meta-tags)  Bold, italics, heading etc.

6 How to use those extra features?

7 Specific approaches to classify hypertext  Chakrabarti et al SIGMOD 98  Oh et al. SIGIR 00  Slattery & Mitchell ICML 00  Goal is not classification but retrieval Bharat & Henzinger SIGIR 98 Croft & Turtle 93

8 Chakrabarti et al. SIGMOD 98  Use the page and linkage information  Add words from the “neighbors” and treat them as belonging to the page itself Decrease in performance (not surprising) Link information is very noisy  Use topic information from neighbors instead

9 Data Sets  IBM Patent Database 12 classes (630 train, 300 test for each class)  Yahoo 13 classes, 20000 docs (for expts involving hypertext, only 900 documents were used) (?)

10 Experiments  Using text from neighbors Local+Neighbor_Text:  Local+Neighbor_Text_Tagged:   Assume Neighbors are Pre-classified Text – 36% Link – 34% Prefix – 22.1% (words in class heirarchy used) Text+Prefix – 21%

11 Oh et al. SIGIR 2000  Relationship b/w class and neighbors of a web page in the training set is not consistent/useful (?)  Instead, Use the class and neighbor info of the page being classified (use regularities in the test set)

12  Classify test instance d by: Classification

13 Algorithm  For each test document d, generate a set A of “trustable” neighbors  For all terms t i in d, adjust the term weight using the term weights from A  For each doc a in A, assign a max confidence value if its class is known otherwise assign a class probabilistically and give it partial confidence weight  Classify d using the equation given earlier.

14 Experiments  Reuters used to assess the algorithm on datasets without hyperlinks – only varying the size of the training set & # of features (?) Results not directly comparable but numbers similar to reported results  Articles from an encyclopedia – 76 classes, 20836 documents

15 Results  Terms+Classes > Only Classes > Only Terms > No use of inlinks

16 Other issues  Link discrimination  Knowledge of neighbor classes  Use of links in training set  Inclusion of new terms from neighbors

17 Comparison ChakrabartiOh et al.Improvement Links in training set YN5% Link discrimination NY6.7% Knowledge of neighbor class YY6.6% 1.9% IterationYN1.5% Using new terms from neighbors YN31.4%

18 Slattery & Mitchell ICML 00  Given a problem setting in which the test set contains structural regularities, How can we find and use them?

19 Hubs and Authorities Kleinberg (1998) “.. a good hub is a page that points to many good authorities; a good authority is a page pointed to by many good hubs.” HubsAuthorities

20 Hubs and Authorities Kleinberg (1998) “Hubs and authorities exhibit what could be called a mutually reinforcing relationship” Iterative relaxation: HubsAuthorities

21 The Plan  Take an existing learning algorithm  Extend it to exploit structural regularities in the test set Using Hubs and Authorities as inspiration

22 FOIL Quinlan & Cameron-Jones (1993) Learns relational rules like: target_page(A) :- has_research(A), link(A,B), has_publications(B). For each test example  Pick matching rule with best training set performance p.  Predict positive with confidence p

23 FOIL-Hubs Representation Add two rules to a learned rule set  target_page(A):-link(B,A),target_hub(B).  target_hub(A):- link(A,B),target_page(B). Talk about confidence rather than truth  target_page(page15) = 0.75 Evaluate by summing instantiations

24 FOIL-Hubs Algorithm 1.Apply learned FOIL rules: learned(A) 2.Iterate 1. Evaluate target_hub(A) 2. Evaluate target_page(A) 3. Set target_page(A) = 3.Report target_page(A)

25 FOIL-Hubs Algorithm Learned FOIL rules foil(A) target_hub(A) target_page(A) 1.Apply learned FOIL rules to test set 2.Initialise target_page(A) confidence from foil(A) 3.Evaluate target_hub(A) 4.Evaluate target_page(A) 5.target_page(A)=target_page(A)  s +foil(A)

26 Data Set  4127 pages from Computer Science departments of four universities: Cornell UniversityUniversity of Texas at Austin University of WashingtonUniversity of Wisconsin Hand labeled into: Student558 Web pages Course243 Web pages Faculty153 Web pages

27 Experiment Three binary classification tasks 1.Student Home Page 2.Course Home Page 3.Faculty Home Page Leave two university out cross-validation

28 Student Home Page

29 Course Home Page

30 More Detailed Results Partition the test data into  Examples covered by some learned FOIL rule  Examples covered by no learned FOIL rule

31 Student – FOIL covered

32 Student – FOIL uncovered

33 Course – FOIL covered

34 Course – FOIL uncovered

35 Recap  We’ve searched for regularities of the form student_page(A):- link( Web->KB members page,A) in the test set.  We consider this an instance of a regularity schema student_page(A):- link(,A)

36 Conclusions  Test set regularities can be used to improve classification performance  FOIL-Hubs used such regularities to outperform FOIL on three Web page classification problems  We can potentially search for other regularity schemas using FOIL

37 Other work  Using the structure of HTML to improve retrieval. Michal Cutler, Yungming Shih, Weiyi Meng. USENIX 1997 Use tfidf - different different weights to text in different html tags


Download ppt "HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00."

Similar presentations


Ads by Google