Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Taxonomies, Lexicons and Organizing Knowledge Wendi Pohs, IBM Software Group.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Introduction to Supervised Machine Learning Concepts PRESENTED BY B. Barla Cambazoglu February 21, 2014.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Information Retrieval in Practice
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
1 ETT 429 Spring 2007 Microsoft Publisher II. 2 World Wide Web Terminology Internet Web pages Browsers Search Engines.
Presented by Zeehasham Rasheed
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introduction to machine learning
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
1 LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora Chien-Chung Huang Shui-Lung Chuang Lee-Feng Chien Presented by: Vu LONG.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
MINING MULTI-LABEL DATA BY GRIGORIOS TSOUMAKAS, IOANNIS KATAKIS, AND IOANNIS VLAHAVAS Published on July, 7, 2010 Team Members: Kristopher Tadlock, Jimmy.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Algorithmic Detection of Semantic Similarity WWW 2005.
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC
Post-Ranking query suggestion by diversifying search Chao Wang.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
By Pamela Drake SEARCH ENGINE OPTIMIZATION. WHAT IS SEO? Search engine optimization (SEO) is the process of affecting the visibility of a website or a.
Date: 2013/9/25 Author: Mikhail Ageev, Dmitry Lagun, Eugene Agichtein Source: SIGIR’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Improving Search Result.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Data Mining and Text Mining. The Standard Data Mining process.
Search Engine Optimization (SEO) Presentation By Celina Jonesi Small Business Seo – KG Tech.
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Information Organization: Overview
Search Engine Architecture
Taxonomies, Lexicons and Organizing Knowledge
Text Categorization Rong Jin.
Text Categorization Berlin Chen 2003 Reference:
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Query Type Classification for Web Document Retrieval
Information Organization: Overview
Using Link Information to Enhance Web Page Classification
Presentation transcript:

Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender

Outline  Introduction to Classification  Background Classification Types Classification Methods  Applications  Features  Algorithms  Evolution of Websites

What is web page classification?  The process of assigning a web page to one or more predefined category labels (ex: news, sports, business…)  Classification is generally posed as a supervised learning problem Set of labeled data is used to train a classifier which is applied to label future examples

Background - Classification Types  Supervised learning problem broken into sub problems: Subject Classification Functional Classification Sentiment Classification Other types of Classification

Subject Classification  Concerned with subject or topic of the web page Judging whether a page is about arts, business, sports, etc… Functional Classification  Role that the page is playing Deciding a page to be a personal homepage, course page, admissions page, etc…

Sentiment Classification  Focuses on the opinion that is presented in a web page Other types of Classification  Such as genre classification and search engine spam classification

Background - Classification Methods  Binary vs. Multiclass  Single Label vs. Multi Label  Soft vs. Hard  Flat vs. Hierarchical

Binary vs. Multiclass Classification

Single-Label vs. Multi-Label Classification

Soft vs. Hard Classification

Flat vs. Hierarchical Classification

Applications  Why is classification important and how can we use it efficiently?

Constructing, maintaining, or expanding web directories  Web directories provide an efficient way to browse for information within a predefined set of categories  Example: Open Directory Project  Currently constructed by human effort 78,940 editors of ODP

Improving the quality of search results  Big problem with search results is search ambiguity

Helping question and answering systems  Can use classification systems to help improve the quality of answers  Example: Wolfram alpha Other applications  Contextual advertising

Features  What features can we extract from a web page to use to help classify it?

Features - Introduction  Because of features such as the hyperlink …, webpage classification is vastly different from other forms of classification such as plaintext classification.  Features organized into two groups: ○ On-page features – directly located on page ○ Neighbor features – found on related pages

On Page Features  Textual Contents & Tags Bag-of-words ○ N-gram feature Rather than analyzing individual words, group them into clusters of n-words. -Ex: New York vs. new ….. ….. York Yahoo! Has used a 5-gram feature HTML tags – title, heading, metadata, main text URL

On Page Features  Visual Analysis Each page has two representations ○ Text via HTML ○ Visual via the browser Each page can be represented as a visual adjacency multigraph

Features of Neighbors  What happens when a page’s features are missing or are unrecognizable?

Features of Neighbors  Assumptions If page1 is in the neighborhood of many “sports” pages then there is an increasing probability that page1 is also a “sports” page. Linked pages are more likely to have terms in common

Features of Neighbors  Neighbor Selection Focus on pages within 2 steps of target 6 types: parent, child, sibling, spouse, grandparent, and grandchild

Features of Neighbors  Labels  Anchor Text  Surrounding Anchor Text  By using the anchor text, surrounding text, and page title of a parent page in combination with text from target page, classification can be improved.

Features of Neighbors  Implicit Links Connections between pages that appear in the results of the same query and are both clicked by users

Algorithms  What are the algorithmic approaches to webpage classification? Dimension reduction Relational learning Hierarchal classification Information combination

Dimension Reduction  Boost classification by emphasizing certain features that are more useful in classification Feature Weighting ○ Reduces the dimensions of feature space ○ Reduces computational complexity ○ Classification more accurate as a result of reduced space

Dimension Reduction  Methods Use first fragment K-nearest neighbor algorithm ○ Weighted features ○ Weighted HTML Tags ○ Metrics Expected mutual information Mutual information

Relational Learning  Relaxation Labeling

Hierarchical Classification  Based on “divide and conquer” Classification problems split into hierarchical set of sub problems.  Error Minimization When a lower level category is uncertain of whether page belongs or not, shift assignment one level up.

Information Combination  Combine several methods into one Information from different sources are used to train multiple classifiers and the collective work of those classifiers make a final decision.

Conclusion  Webpage classification is a type of supervised learning problem aiming to categorize a webpage into a predefined set of categories.  In the future, efforts will most likely be focused on effectively combining content and link information to build a more accurate classifier

Evolution of Websites  Apple in 1998

Evolution of Websites  Apple 2008

Evolution of Websites  Nike in 2000

Evolution of Websites  Nike in 2008

Evolution of Websites  Yahoo in 1996

Evolution of Websites  Yahoo in 2008

Evolution of Websites  Microsoft in 1998

Evolution of Websites  Microsoft in 2008

Evolution of Websites  MTV in 1998

Evolution of Websites  MTV in 2008

Sources  Web Page Classification: Features and Algorithms by Xiaoguang Qi & Brian D. Davison  Visual Adjacency Multigraphs – A Novel Approach for a Web Page Classification by Milos Kovacevic, Michelangelo Diligenti, Marco Gori, and Veljko Milutinovic  The Evolution of Websites popular-websites.aspx