Classifying University Web Pages According to Academic Field Richard Wang Tim Isganitis 01/26/ Read the Web: Project Proposal
Goal Learn how to classify web pages according to the academic field they relate to. –We (loosely) define academic fields to correspond to academic departments. For example: Computer Science Biological Science Public Policy –We predefine the department names, but an alternative (harder) method is to recognize the names of departments and cluster them according to a broader notion of “field.”
Redundant Features Domain Name – (Computer Science) – (Biology) –We assume that most pages under these domains have to do with the given field. Text of Hyperlink – Computer Science Department Words on a web page –Incorporate word features
Domain Name Classifier Use a dictionary to associate strings that appear in a domain name with types of field. –Probably position dependent: Look for strings to fill –For example: 51% of web pages under are classified as “Computer Science” Assume all web pages under “ would be related to the field of Computer Science
Academic Page Classifier Train a classifier on academic web pages –Labels of web pages are derived from the domain name using Domain Name Classifier –Initially try using simple features (i.e. bag-of-words) to train the classifier –We will try to use Minorthird –For example: Domain Name Classifier indicates that is very likely to be related to Robotics Then incorporate all web pages under as training examples for the academic field Robotics
Learning Loop Given a URL token like “cs” or “bio” we can search for other domains of the form: –The Domain name classifier labels all pages in these domains as Computer Science pages Given a URL such as we can search for other domains of the form: –The text-based classifier labels the abbreviation based on the content of the pages in this domain.