Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Introduction to Supervised Machine Learning Concepts PRESENTED BY B. Barla Cambazoglu February 21, 2014.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Extracting Symbolic Knowledge From The Web Ofer Neiman.
Text Classification With Support Vector Machines
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: Mark Craven,
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University.
HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.
Holistic Web Page Classification William W. Cohen Center for Automated Learning and Discovery (CALD) Carnegie-Mellon University.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Web Page Classification by Academic Fields Richard Wang February 15, 2006.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Exercise Session 10 – Image Categorization
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Bayesian Networks. Male brain wiring Female brain wiring.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
The identification of interesting web sites Presented by Xiaoshu Cai.
Text Classification, Active/Interactive learning.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
EMNLP’01 19/11/2001 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL ML: Classical methods from AI.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Classification Techniques: Bayesian Classification
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Algorithmic Detection of Semantic Similarity WWW 2005.
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Special Topics in Text Mining Manuel Montes y Gómez University of Alabama at Birmingham, Spring 2011.
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Classification using Co-Training
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
General Architecture of Retrieval Systems 1Adrienn Skrop.
KNN & Naïve Bayes Hongning Wang
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Information Organization: Overview
Text Categorization Assigning documents to a fixed set of categories
Special Topics in Text Mining
Michal Rosen-Zvi University of California, Irvine
Query Type Classification for Web Document Retrieval
Information Organization: Overview
Information Retrieval and Web Design
Using Link Information to Enhance Web Page Classification
Presentation transcript:

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo

Contents Introduction Applications Features Algorithms Experiments

Introduction Large amount of web pages on the World Wide Web Web information retrieval tasks: crawling, searching, extracting KBs,…

Introduction Subject classification: consider the subject or topic of web page. Example: “business”, “sports”,… Functional classification: role of web pages. Example: course page, researcher homepage,…

Applications Improving quality of search result Building focused crawler Extracting KBs

Improving Search Results Solve the query ambiguity User is asked to specify before searching (Chekuri et al. [1997]) Present the categorized view of results to users (Kaki [2005])

Building Focused Crawler When only domain-specific queries are expected, performing a full crawl is usually inefficient. Only documents relevant to a predefined set of topics are of interest. (Chakrabarti et al. [1999])

Extracting KBs Store complex structured and unstructured information from the World Wide Web to make a computer understandable environment. First step : recognize class instances by classifying web’s content. (Craven et al. [1998])

Feature Selection Textual contents, HTML tags, hyperlinks, anchor texts On-page features Neighbors features

On-page Features Textual Content ▫Bag-of-words ▫N-gram representation: n consecutive words (Mladenic [1998]). Example: New York, new, york HTML tags: Ardo [2005] URL: Kan and Thi [2005], Sujatha [2013]. Positive point: reduce processing time

Neighbors Features (1) Weak assumption: neighbor pages of the pages belong to the same category share common characteristics Strong assumption: a page is much more likely to be surrounded by pages of the same category.

Neighbors Features (2)

Neighbors Features (3) Sibling pages are more useful than parents and children. (Chakrabarti et al. [1998], Qi and Davison [2006]) The content of neighbors need to be sufficiently similar to the target page. (Oh et al. [2000]) Using a portion of content on parent and child pages: title, anchor text, and the surrounding text of anchor text on the parent pages

Algorithms k-NN Co-training Naïve Bayes

K-NN Kwon and Lee [2000] Bag-of-words

Co-traning Blum and Mitchell [1998] Labeled and unlabeled data Two classifiers that are trained on different sets of features are used to classify the unlabeled instances. The prediction of each classifier is used to train the other.

Web Page Classification using Naive Bayes Bernoulli model: a document is represented by a feature vector with binary elements taking value 1 if the corresponding word is present in the document and 0 if the word is not present ▫E.g: consider the vocabulary: and the short document “the blue dog ate a blue biscuit”. The Bernoulli feature vector is: b = (1, 0, 1, 0, 1, 0) T Consider a web page D, whose class is given by C, we classify D as the class which has the highest posterior probability P(C |D): 17

Web Page Classification using Naive Bayes The document likelihood P(D i |C) : Where:  b i : Bernoulli feature vector.  P( w t |C ) : the probability of word w t occurring in a document of class C.  n k (w t ) be the number of documents of class C = k in which w t is observed.  N k is the total document of class C = k. The prior term: 18

Experimental Results 19 Dataset: WebKB ▫Contains 8145 webs pages. ▫Seven categories: student, faculty, staff, course, project, department and othe r. ▫Data is collected in 4 departments and some pages from other universities.  Cornell, Texas, Washington, Wisconsin, and others. Experimental setup: ▫Select four most populous categories: student, faculty, course, and project. ▫Training data: Cornell, Washington, Texas and miscellaneous pages co llected from other universities. ▫Testing data: Wisconsin.

Experimental Results 20 ClassesFacultyCourseStudentProject # of training pages # of testing pages accuracy Experimental result:

21 THANK YOU