Download presentation
Presentation is loading. Please wait.
1
Classifying University Web Pages According to Academic Field Richard Wang Tim Isganitis 01/26/2006 11-709 Read the Web: Project Proposal
2
Goal Learn how to classify web pages according to the academic field they relate to. –We (loosely) define academic fields to correspond to academic departments. For example: Computer Science Biological Science Public Policy –We predefine the department names, but an alternative (harder) method is to recognize the names of departments and cluster them according to a broader notion of “field.”
3
Redundant Features Domain Name –www.cs.cmu.edu (Computer Science)www.cs.cmu.edu –www.bio.indiana.edu (Biology)www.bio.indiana.edu –We assume that most pages under these domains have to do with the given field. Text of Hyperlink – Computer Science Department Words on a web page –Incorporate word features
4
Domain Name Classifier Use a dictionary to associate strings that appear in a domain name with types of field. –Probably position dependent: Look for strings to fill www...edu –For example: 51% of web pages under www.cs.abc.edu are classified as “Computer Science”www.cs.abc.edu Assume all web pages under “www.cs..edu” would be related to the field of Computer Science
5
Academic Page Classifier Train a classifier on academic web pages –Labels of web pages are derived from the domain name using Domain Name Classifier –Initially try using simple features (i.e. bag-of-words) to train the classifier –We will try to use Minorthird –For example: Domain Name Classifier indicates that www.ri.abc.edu is very likely to be related to Robotics Then incorporate all web pages under www.ri.abc.edu as training examples for the academic field Robotics
6
Learning Loop Given a URL token like “cs” or “bio” we can search for other domains of the form: www.cs..edu www.cs..edu –The Domain name classifier labels all pages in these domains as Computer Science pages Given a URL such as www.cs.cmu.edu we can search for other domains of the form: www..cmu.eduwww.cs.cmu.edu www.<dept>.cmu.edu –The text-based classifier labels the abbreviation based on the content of the pages in this domain.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.