Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University

How is hypertext different?  Link Information (possibly useful but noisy)  Diverse Authorship  Short text - topic not obvious from the text  Structure / position within the web graph  Author-supplied features (meta-tags)  External Sources of Information (Meta- Data)  Bold, italics, heading etc.

Goal  Present several hypothesis about regularities in hypertext classification tasks  Describe methods to exploit these regularities  Evaluate the different methods and regularities on real-world hypertext datasets

Regularities in Hypertext  No Regularity  “Encyclopedia” Regularity  “Co-Referencing” Regularity  Partial “Co-Referencing” Regularity  Preclassified Regularity  Meta-Data Regularity

No Regularity “Encyclopedia” Regularity “Co-Referencing” Regularity The documents are linked at random, or at least independent of the document class The majority of linked documents share the same class as a document. Encyclopedia articles generally reference other articles which are topically similar. Documents with the same class tend to link to documents not of that class, but which are topically similar to each other. University student index pages which tend not to link to other student index pages, but do link mostly to home pages of students.

Partial “Co-Referencing” Regularity Pre-Classified Regularity Meta-Data Regularity “Coreferencing” regularity where we might have more than a few “noisy” links Many students may point to pages about their hobbies, but also link to a wide variety of other pages which are less unique to student home pages Either one page, or some small set of pages, may contain lists of hyperlinks to pages that are mostly members of the same class Any page from the Yahoo topic hierarchy Metadata available from external sources on the web that can be exploited in the form of additional features. Movie reviews for movie classification, online discussion boards for various other topic classification tasks (such as stock market predictions or competitive analysis).

Ignore Links Use All the Text From Neighbors Use All the Text From Neighbors Separately Use standard text classifiers on the text of the document itself Also serves as baseline Augment the text of each document with the text of its neighbors Adding more topicrelated words to the document. Add the words of linked documents, but treating them as if they come from a separate vocabulary. A simple way to do this is to prefix the words in the linked documents with a tag, such as linkedword:

Look for Linked document subsets Use the identity of the linked documents Use External Features / Meta-Data Search for the topically similar linked pages At the top level, this is a clustering problem to find similar documents among all the documents linked to documents in the same class. Search for these pages by representing each page with only the names of the pages it links with. Collect features that relate two or more entities/documents being classified using information extraction techniques. These extracted features can then be used in a similar fashion by using the identity of the related documents and by using the text of related documents in various ways.

Learning Algorithms Used  Naïve Bayes (NB) –Probabilistic, Builds a Generative Model  k Nearest Neighbor (kNN) –Example-based  First Order Inductive Learner (FOIL) –Relational Learner

Datasets  A collection of up to 50 web pages from 4285 companies (as used in Ghani et al. 2000)  Two types of classifications (labels obtained from www.hoovers.com) –Coarse-grained Classification - 28 classes –Fine-grained Classification – 255 classes  Classification is at the level of Companies so that task is to classify the company by collapsing all of the web pages in a corporate website.

Accuracy for 28 Class Task

Accuracy for 255 Class Task

Accuracy Vs. Feature Size

Conclusions  Hyperlinks can be extremely noisy and harmful for classification  Meta-Data about websites can be useful and techniques for automatically finding meta-data should be explored  Naïve Bayes and kNN are suitable since they scale up well for the noise and feature-set size while FOIL has the power to discover relational regularities that cannot be explicitly identified by others.

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Similar presentations

Presentation on theme: "Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Similar presentations

Presentation on theme: "Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback