Web Page Classification by Academic Fields Richard Wang February 15, 2006
Introduction Objective Train a classifier that classifies web pages by academic field using semi-supervised method Identify interests/affiliations of people Filter web pages for field-specific applications (i.e. an N.E.R. trained on C.S. web pages) Assumptions Academic fields correspond to academic departments All web pages under an academic departmental website is related to the academic field that the department corresponds to
Academic Fields We pre-define six academic fields (also showing an example of each of their academic departmental URLs) : Biological Sciences (i.e. web.mit.edu/biology/www) Computer Science (i.e. Economics (i.e. History (i.e. Law (i.e. Music (i.e.
System Architecture Academic Field Queries Google Candidate Dept. URLs (Field?, URLs) Simple URL Classifier True Dept. URLs (Field, URLs) Web Crawler True Dept. Pages (Field, Pages) Candidate Dept. Pages (Field?, URLs, Pages) Web Page Classifier If Match External Module (Optional)
Candidate Dept. URLs Manually devised Google queries for extracting candidate departmental URLs: The extracted URLs are then sent to A simple URL classifier The web crawler for crawling allintitle: "Biological Sciences" OR Biology School OR Department OR Institute site:edu allintitle: "Computer Science" -Mathematics School OR Department OR Institute site:edu allintitle: Economics School OR Department OR Institute site:edu allintitle: History -Art School OR Department OR Institute site:edu allintitle: Law School OR Department OR Institute site:edu allintitle: Music School OR Department OR Institute site:edu
Simple URL Classifier Learns URL from candidate dept. URLs by keeping count of their term frequencies The classifier determines the academic field of a URL by searching for those top URL tokens Academic FieldsTop Common Tokens in URL Biological Sciences:biology (64%), bio (10%), biol (5%) Computer Science:cs (69%), csc (3%), compsci (3%), cse (3%) Economics:econ (44%), economics (38%), economic (4%) History:history (80%), hist (4%) Law:law (71%) Music:music (86%), mus (2%)
System Architecture Academic Field Queries Google Candidate Dept. URLs (Field?, URLs) Simple URL Classifier True Dept. URLs (Field, URLs) Web Crawler True Dept. Pages (Field, Pages) Candidate Dept. Pages (Field?, URLs, Pages) Web Page Classifier If Match External Module (Optional)
Web Page Classifier Since learning is iterative, we need a fast non- binary classifier: KNN is fast during training but extremely slow during testing One vs. All learner that uses a simple inner learner can be very fast during training and testing We decided to use One vs. All with Naïve Bayes as the inner learner and a simple set of features: bag-of-words
System Architecture Academic Field Queries Google Candidate Dept. URLs (Field?, URLs) Simple URL Classifier True Dept. URLs (Field, URLs) Web Crawler True Dept. Pages (Field, Pages) Candidate Dept. Pages (Field?, URLs, Pages) Web Page Classifier If Match External Module (Optional)
Experimental Setting Initial training set (seed) One entire website for each academic field Manually verified that those websites are indeed departmental websites A total of web pages (18MB) Test set Same setting as the initial training set but with different websites A total of 1824 web pages (2MB)
Experimental Results
Confusion Matrix
Classifier Analysis (1) Biological SciencesComputer Science
Classifier Analysis (2) EconomicsHistory
Classifier Analysis (3) LawMusic
Conclusion & Future Work Classification performance can be improved by using unlabeled data Try more iterations in the experiments Try to learn/classify more academic fields Try other multi-class classifiers
Thank You Questions?