Using Link Information to Enhance Web Page Classification

Slides:

Advertisements

Similar presentations

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Advertisements

Evaluating Novelty and Diversity Charles Clarke School of Computer Science University of Waterloo two talks in one!

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.

Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.

Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.

Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.

Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.

University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.

1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Overview of Search Engines

Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Chien-Shing Chen Author: Tie-Yan.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.

1 LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora Chien-Chung Huang Shui-Lung Chuang Lee-Feng Chien Presented by: Vu LONG.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.

Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.

Group Sparse Coding Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow Google Mountain View, CA (NIPS2009) Presented by Miao Liu July

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation ： Yao-Min Huang Date ： 09/15/2004.

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)

Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization --- Lei Tang, Jianping Zhang and Huan Liu.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

Algorithmic Detection of Semantic Similarity WWW 2005.

Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :

Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99.

Ranking Related Entities Components and Analyses CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

 Who Uses Web Search for What? And How?. Contribution  Combine behavioral observation and demographic features of users  Provide important insight.

Post-Ranking query suggestion by diversifying search Chao Wang.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.

Bringing Order to the Web : Automatically Categorizing Search Results Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Hao Chen Susan Dumais.

Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.

Wen Chan 1 ， Jintao Du 1, Weidong Yang 1, Jinhui Tang 2, Xiangdong Zhou 1 1 School of Computer Science, Shanghai Key Laboratory of Data Science, Fudan.

Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.

1 Clustering Web Queries John S. Whissell, Charles L.A. Clarke, Azin Ashkan CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/08/31.

Information Retrieval in Practice

Using Web Structure for Classifying and Describing Web Pages

An Empirical Study of Learning to Rank for Entity Search

Query Prediction by Currently-Browsed Web Pages and Its Applications

Artface (Automated reorganization to fit approximate client expectations) Mike Venzke 9/19/2018.

Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.

Graph Based Multi-Modality Learning

Information Retrieval

A Markov Random Field Model for Term Dependencies

Intent-Aware Semantic Query Annotation

Text Categorization Rong Jin.

Mining Anchor Text for Query Refinement

Text Categorization Berlin Chen 2003 Reference:

Query Type Classification for Web Document Retrieval

Presentation transcript:

Using Link Information to Enhance Web Page Classification Xiaoguang Qi, Brian D. Davison

Introduction Web page classification is important Browsing information through topics Query result tagging Finding similar documents Clustering query results Applying textual classifiers on web data Not satisfying

Our Approach Using information of neighboring pages to help judge a page’s topic Four kinds of neighbors Parents, children, siblings, co-spouses Neighboring pages may have been labeled Appearing in web hierarchies Use the labels is available Pages without existing labels: use a classifier

Other Considerations Are the four kinds of neighbors equally important? Give them different weights β = (β1, β2, β3, β4) The use of classifier may introduce noise Down-weight the results of classifier: η 0≤η≤1

Other Considerations (Cont.) Do intra-host links count? They are often down-weighted or ignored in link-based ranking Web page classification is a different scenario Give it a weight: θ (θ =0,1) Counting the multiple paths Siblings may have multiple parents in common Weighted path version vs. unweighted path version

Other Considerations (Cont.) Combining neighbors with the start page Weighted average: α (0≤α≤1 ) α* start page+(1- α)*neighbors

Experimental Setup 12 top-level categories in DMoz Directory 19,000 document from each category to train the text classifier 1,000 for testing Get incoming links by querying Yahoo API

Parameter Tuning

Parameter Tuning (Cont.)

Parameter Tuning (Cont.)

Parameter Tuning (Cont.)

Experimental Results Best performance is achieved at the settings: α=0.2, β= (0, 0, 1, 0), η=0, θ=1, weighted path version

Experimental Results (Cont.) “DMoz copy effect” We are benefiting from it! It may affect the optimal parameter setting Solution Remove the pages whose URL contains directory names of DMoz E.g. “Computers/Hardware”, “Business/Employment”

Conclusion Improved the accuracy of web page classification Explored the effects of a number of parameters

Future Work Is the parameter tuning independent of dataset? Is our dataset representative of the web? Other classifiers What’s the effect of the number of categories and the granularity of the categories