Download presentation
Presentation is loading. Please wait.
Published byLogan Lane Modified over 9 years ago
1
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University
2
How is hypertext different? Link Information (possibly useful but noisy) Diverse Authorship Short text - topic not obvious from the text Structure / position within the web graph Author-supplied features (meta-tags) External Sources of Information (Meta- Data) Bold, italics, heading etc.
3
Goal Present several hypothesis about regularities in hypertext classification tasks Describe methods to exploit these regularities Evaluate the different methods and regularities on real-world hypertext datasets
4
Regularities in Hypertext No Regularity “Encyclopedia” Regularity “Co-Referencing” Regularity Partial “Co-Referencing” Regularity Preclassified Regularity Meta-Data Regularity
5
No Regularity “Encyclopedia” Regularity “Co-Referencing” Regularity The documents are linked at random, or at least independent of the document class The majority of linked documents share the same class as a document. Encyclopedia articles generally reference other articles which are topically similar. Documents with the same class tend to link to documents not of that class, but which are topically similar to each other. University student index pages which tend not to link to other student index pages, but do link mostly to home pages of students.
6
Partial “Co-Referencing” Regularity Pre-Classified Regularity Meta-Data Regularity “Coreferencing” regularity where we might have more than a few “noisy” links Many students may point to pages about their hobbies, but also link to a wide variety of other pages which are less unique to student home pages Either one page, or some small set of pages, may contain lists of hyperlinks to pages that are mostly members of the same class Any page from the Yahoo topic hierarchy Metadata available from external sources on the web that can be exploited in the form of additional features. Movie reviews for movie classification, online discussion boards for various other topic classification tasks (such as stock market predictions or competitive analysis).
7
Ignore Links Use All the Text From Neighbors Use All the Text From Neighbors Separately Use standard text classifiers on the text of the document itself Also serves as baseline Augment the text of each document with the text of its neighbors Adding more topicrelated words to the document. Add the words of linked documents, but treating them as if they come from a separate vocabulary. A simple way to do this is to prefix the words in the linked documents with a tag, such as linkedword:
8
Look for Linked document subsets Use the identity of the linked documents Use External Features / Meta-Data Search for the topically similar linked pages At the top level, this is a clustering problem to find similar documents among all the documents linked to documents in the same class. Search for these pages by representing each page with only the names of the pages it links with. Collect features that relate two or more entities/documents being classified using information extraction techniques. These extracted features can then be used in a similar fashion by using the identity of the related documents and by using the text of related documents in various ways.
9
Look for Linked document subsets Use the identity of the linked documents Use External Features / Meta-Data Search for the topically similar linked pages At the top level, this is a clustering problem to find similar documents among all the documents linked to documents in the same class. Search for these pages by representing each page with only the names of the pages it links with. Collect features that relate two or more entities/documents being classified using information extraction techniques. These extracted features can then be used in a similar fashion by using the identity of the related documents and by using the text of related documents in various ways.
10
Learning Algorithms Used Naïve Bayes (NB) –Probabilistic, Builds a Generative Model k Nearest Neighbor (kNN) –Example-based First Order Inductive Learner (FOIL) –Relational Learner
11
Datasets A collection of up to 50 web pages from 4285 companies (as used in Ghani et al. 2000) Two types of classifications (labels obtained from www.hoovers.com) –Coarse-grained Classification - 28 classes –Fine-grained Classification – 255 classes Classification is at the level of Companies so that task is to classify the company by collapsing all of the web pages in a corporate website.
12
Accuracy for 28 Class Task
13
Accuracy for 255 Class Task
14
Accuracy Vs. Feature Size
15
Conclusions Hyperlinks can be extremely noisy and harmful for classification Meta-Data about websites can be useful and techniques for automatically finding meta-data should be explored Naïve Bayes and kNN are suitable since they scale up well for the noise and feature-set size while FOIL has the power to discover relational regularities that cannot be explicitly identified by others.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.