Using Web Structure for Classifying and Describing Web Pages

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Text Classification With Support Vector Machines
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Decision Tree Algorithm
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Chapter 5: Information Retrieval and Web Search
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Presented by Tienwei Tsai July, 2005
The identification of interesting web sites Presented by Xiaoshu Cai.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Fuzzy Entropy based feature selection for classification of hyperspectral data Mahesh Pal Department of Civil Engineering National Institute of Technology.
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
Universit at Dortmund, LS VIII
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Chapter 6: Information Retrieval and Web Search
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Instructor: Pedro Domingos
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Boosting the Feature Space: Text Classification for Unstructured.
CS 9633 Machine Learning Support Vector Machines
Queensland University of Technology
OPERATING SYSTEMS CS 3502 Fall 2017
What Is Cluster Analysis?
How to forecast solar flares?
Instructor: Pedro Domingos
Syntax-based Deep Matching of Short Texts
Source: Procedia Computer Science(2015)70:
Machine Learning Basics
CSEP 546 Data Mining Machine Learning
Sangeeta Devadiga CS 157B, Spring 2007
PEBL: Web Page Classification without Negative Examples
Tremor Detection Using Motion Filtering and SVM Bilge Soran, Jenq-Neng Hwang, Linda Shapiro, ICPR, /16/2018.
Learning with information of features
Categorization by Learning and Combing Object Parts
CSEP 546 Data Mining Machine Learning
The Naïve Bayes (NB) Classifier
Concave Minimization for Support Vector Machine Classifiers
Fig. 1 (a) The PageRank algorithm (b) The web link structure
A Novel Smoke Detection Method Using Support Vector Machine
University of Wisconsin - Madison
Using Link Information to Enhance Web Page Classification
Presentation transcript:

Using Web Structure for Classifying and Describing Web Pages Eric J. Glover1, Kostas Tsioutsiouliklis1,2, Steve Lawrence1, David M. Pennock1, Gary W. Flake1 International World Wide Web Conference, 2002 Presented by Zaihan Yang CSE Web Mining

Introduction Aim Classification of web pages Description of web pages (to name clusters of web pages) Using Web Structure Extracting patterns from hyperlinks in the web. HyperLink The destination page Associated anchortext describing link

Typical Text-based classification To utilize the words (or phrases) of a target document, considering the most significant features. Not Effective. E.g. The home page of General Motors (www.gm.com) does not state that they are a car company. Full text Anchortext Extended-anchortext A combination

Virtual Document A virtual document is: Anchortext: A collection of anchortexts or extended anchortexts from links pointing to the target document. Anchortext: The words occurring inside of a link Extended anchortext: The set of rendered words occurring up to 25 words before and after an associated link (as well as the anchortext itself).

Main Method Main Procedure Full-text classifier Virtual documents classifier Two Improvement methods Name a cluster Main Procedure Datasets Features EFL Ranking Train SVM

Datasets Positive: a set of web pages downloaded from various Yahoo! Categories. Negative: Random documents from outside Yahoo! WebKB dataset Features: All words and two or three word phrases i.e. My favorite game is scrabble. Possible features: My, my favorite, my favorite game, favorite, favorite game, etc.

Dimensionality reduction To remove useless features. Two step process: First, remove all features that do not occur in a specified percentage of documents. i.e. (|Af|/|A| < T+) and (|Bf|/|B| < T-) A: the set of positive examples. B: the set of negative examples. Af: documents in A that contain feature f. Bf: documents in B that contain feature f. T+: threshold for positive features. T-: threshold for negative features. Second, rank the remaining features based on expected entropy loss.

Expected Entropy Loss The expected entropy loss: The prior entropy of the class distribution: The posterior entropy of the class when the feature is present: The posterior entropy of the class when the feature is absent: The expected entropy loss:

Train SVM A set of data points: {(x1,y1),…, (xN,yN)} xi is an input and yi is a target output (1 or -1). Separating hyperplane: w•φ(xi) + b = 0 w•φ(xi) + b ≥ 1 if yi = 1 w•φ(xi) + b ≤ -1 if yi = -1 w•φ(xi) + b where minimizing Output: Kernel function:

Improvement-Uncertainty Sampling The result from an SVM classifier is a real number from -∞ to +∞. When the output is on the interval (-1,1) it is less certain than if it is on the intervals (-∞,-1) and (1,+∞). The region (-1,1) is called the “uncertain region”. Uncertainty sampling A human judges the documents in the “uncertain region”

Improvement-Combination To combine results from the extended anchortext based classifier with the less accurate full-text classifier. Result of extended-AT classifier Web page Extended-AT classifier Negative but uncertain? Full-text Positive and |output| > |outputAT|? N Y positive negative

Name the Cluster Using the top ranked features extracted from the extended anchotexts virtual documents to name a cluster. Beliefs: The words near the anchortexts are descriptions of the target documents. The top ranked features by expected entropy loss are those which occur in many positive examples,and few negative ones.

Results-classifying Anchortext alone is comparable for classification purpose with the full-text. Classification accuracy is significantly improved when using the extended anchortext instead of the document full-text. Combination method is highly effective for improving positive-class accuracy, but reduces negative class accuracy. Uncertainty sampling required examining only 8% of the documents on average, while providing an average positive class accuracy improvement of almost 10 percentage points.

Result--Clustering The full-text appears comparable to the extended anchortext. The anchortext alone appears to do a poor job of describing the category.

Future Work To include other features on the inbound web pages besides extended anchortext: To examine the effects of the number of inbound links. To examine the nature of the category by expanding this to thousands of categories. To study the effects of the positive set size.