Web Page Classification by Academic Fields Richard Wang February 15, 2006.

Slides:



Advertisements
Similar presentations
PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, Kevin Chen- Chuan Chang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Advertisements

Chapter 5: Introduction to Information Retrieval
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Final review LING572 Fei Xia Week 10: 03/13/08 1.
Named Entity Recognition in an Intranet Query Log Richard Sutcliffe 1, Kieran White 1, Udo Kruschwitz University of Limerick, Ireland 2 - University.
Classifying University Web Pages According to Academic Field Richard Wang Tim Isganitis 01/26/ Read the Web: Project Proposal.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
The Further Mathematics network
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
The 2nd International Conference of e-Learning and Distance Education, 21 to 23 February 2011, Riyadh, Saudi Arabia Prof. Dr. Torky Sultan Faculty of Computers.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Text Classification, Active/Interactive learning.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Tokeniser Francisco Miguel Pérez Romero University of Sevilla.
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Software Agents for Web Mining FYP Project by: Shuchi Mittal Quek Siew Guat Patricia Professor: Franklin Fu.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
IST 441 Example Projects. Undergrad Project Find a customer – interest in xbox game forum Build a search engine for Xbox game forums etc. Compare two.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
A Language Independent Method for Question Classification COLING 2004.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Improving Classification Accuracy Using Automatically Extracted Training Data Ariel Fuxman A. Kannan, A. Goldberg, R. Agrawal, P. Tsaparas, J. Shafer Search.
Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach Franco Salvetti Nicolas Nicolov Dept. of Computer Science, Univ.
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Selecting Good Expansion Terms for Pseudo-Relevance Feedback Guihong Cao, Jian-Yun Nie, Jianfeng Gao, Stephen Robertson 2008 SIGIR reporter: Chen, Yi-wen.
2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition.
Musical Genre Categorization Using Support Vector Machines Shu Wang.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Computer Science & Engineering 2111 Database Objects 1 CSE 2111 Introduction to Database Management Systems.
Active, Semi-Supervised Learning for Textual Information Access Anastasia Krithara¹, Cyril Goutte², Massih-Reza Amini³, Jean-Michel Renders¹ Massih-Reza.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
Source: Procedia Computer Science(2015)70:
PEBL: Web Page Classification without Negative Examples
Mark Chavira Ulises Robles
Presentation transcript:

Web Page Classification by Academic Fields Richard Wang February 15, 2006

Introduction Objective  Train a classifier that classifies web pages by academic field using semi-supervised method Identify interests/affiliations of people Filter web pages for field-specific applications (i.e. an N.E.R. trained on C.S. web pages) Assumptions  Academic fields correspond to academic departments  All web pages under an academic departmental website is related to the academic field that the department corresponds to

Academic Fields We pre-define six academic fields (also showing an example of each of their academic departmental URLs) :  Biological Sciences (i.e. web.mit.edu/biology/www)  Computer Science (i.e.  Economics (i.e.  History (i.e.  Law (i.e.  Music (i.e.

System Architecture Academic Field Queries Google Candidate Dept. URLs (Field?, URLs) Simple URL Classifier True Dept. URLs (Field, URLs) Web Crawler True Dept. Pages (Field, Pages) Candidate Dept. Pages (Field?, URLs, Pages) Web Page Classifier If Match External Module (Optional)

Candidate Dept. URLs Manually devised Google queries for extracting candidate departmental URLs: The extracted URLs are then sent to  A simple URL classifier  The web crawler for crawling allintitle: "Biological Sciences" OR Biology School OR Department OR Institute site:edu allintitle: "Computer Science" -Mathematics School OR Department OR Institute site:edu allintitle: Economics School OR Department OR Institute site:edu allintitle: History -Art School OR Department OR Institute site:edu allintitle: Law School OR Department OR Institute site:edu allintitle: Music School OR Department OR Institute site:edu

Simple URL Classifier Learns URL from candidate dept. URLs by keeping count of their term frequencies The classifier determines the academic field of a URL by searching for those top URL tokens Academic FieldsTop Common Tokens in URL Biological Sciences:biology (64%), bio (10%), biol (5%) Computer Science:cs (69%), csc (3%), compsci (3%), cse (3%) Economics:econ (44%), economics (38%), economic (4%) History:history (80%), hist (4%) Law:law (71%) Music:music (86%), mus (2%)

System Architecture Academic Field Queries Google Candidate Dept. URLs (Field?, URLs) Simple URL Classifier True Dept. URLs (Field, URLs) Web Crawler True Dept. Pages (Field, Pages) Candidate Dept. Pages (Field?, URLs, Pages) Web Page Classifier If Match External Module (Optional)

Web Page Classifier Since learning is iterative, we need a fast non- binary classifier:  KNN is fast during training but extremely slow during testing  One vs. All learner that uses a simple inner learner can be very fast during training and testing We decided to use One vs. All with Naïve Bayes as the inner learner and a simple set of features: bag-of-words

System Architecture Academic Field Queries Google Candidate Dept. URLs (Field?, URLs) Simple URL Classifier True Dept. URLs (Field, URLs) Web Crawler True Dept. Pages (Field, Pages) Candidate Dept. Pages (Field?, URLs, Pages) Web Page Classifier If Match External Module (Optional)

Experimental Setting Initial training set (seed)  One entire website for each academic field  Manually verified that those websites are indeed departmental websites  A total of web pages (18MB) Test set  Same setting as the initial training set but with different websites  A total of 1824 web pages (2MB)

Experimental Results

Confusion Matrix

Classifier Analysis (1) Biological SciencesComputer Science

Classifier Analysis (2) EconomicsHistory

Classifier Analysis (3) LawMusic

Conclusion & Future Work Classification performance can be improved by using unlabeled data Try more iterations in the experiments Try to learn/classify more academic fields Try other multi-class classifiers

Thank You Questions?