1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

Slides:



Advertisements
Similar presentations
An Introduction To Categorization Soam Acharya, PhD 1/15/2003.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor.
Information Retrieval in Practice
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
Presented by Zeehasham Rasheed
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Bayesian Networks. Male brain wiring Female brain wiring.
Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.
Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma TsingHua University, *Microsoft.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
Algorithmic Detection of Semantic Similarity WWW 2005.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Date: 2013/9/25 Author: Mikhail Ageev, Dmitry Lagun, Eugene Agichtein Source: SIGIR’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Improving Search Result.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Information Retrieval in Practice
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Exploring Social Tagging Graph for Web Object Classification
Information Organization: Overview
Search Engine Architecture
Lecture 15: Text Classification & Naive Bayes
Text Categorization Assigning documents to a fixed set of categories
Information Organization: Overview
Introduction to Search Engines
Presentation transcript:

1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004

2 Outline Motivation Related work Architecture Overview Summarizer Methods (1~4) Experiments Conclusion Future Work

3 Motivation To facilitate the web users to find the desired information. Browse: Navigate through hierarchical collections Search: submit a query to search engine Much work has been done on Web-Page Classification. Hyperlink Summarization is a good method to filter the noise from the web page.

4 Related Work Overview of summarization: Goal of summarization: Summary generation methods seek to identify document contents that convey the most “important” information within the document Types of Summarization Indicative vs. informative Extraction vs. Abstract Generic vs. query-oriented Unsupervised vs. Supervised Single-document vs. multi-document

5 Related Work (Cont.) -- Summarization in IR Methods Unsupervised Methods—Cluster and Select Supervised Methods Applications Generic summaries for indexing in information retrieval (Tetsuya Sakai SIGIR2001) Term Selection in Relevance Feedback for IR (A.M.Lam-Adesina SIGIR2001)

6 Architecture Overview Testing set Train set Train Summaries Testing Summaries Classifier (NB/SVM) Result Ensemble Summarization LuhnLSASupervisedPage-layout analysisHuman Classification ( 10-fold cross validation )

7 Summarizer 1: Adapted Luhn’s Method (IBM journal 1958) Assumption: The more “Significant Words” there are in a sentence and the closer they are, the more meaningful the sentence is. Approach The sentences with the highest significance factor are selected to form the summary. An Example — — — [ # — — # — # # # ] — — — # significance factor = 5*5/8 = frequency is between high-frequence cutoff and low-frequency cutoff Limit L=2, which two # could be considered as being significantly related

8 Summarizer 1: Adapted Luhn’s Method (Cont.) Build Significant word pool Cat 1 Cat 2 Cat m Significant Word Pool Significant Word Pool Significant Word Pool … … … … Sentences in a training page from category m Summary For train pages Sentences from a testing page Summary average For testing pages A web page Significant word pool Sentences in this page Summary Original methodAdapted method

9 Summarizer 2: Latent Semantic Analysis (SIGIR2001) A fully automatic mathematical/statistical technique for extracting and inferring relations of expected contextual usage of words in passages of discourse. Overview Given an m  n term-by-sentence matrix A; A = U∑V T ∑: diag(  1, …,  r, 0, …, 0), sorted singular value r =rank(A) Um  n: whose column vectors are called left singular vectors (salient patterns among the terms) Vn  n: whose column vectors are called right singular vectors (salient patterns among the sentences)

10 Summarizer 2: Latent Semantic Analysis (Cont.) = ×∑××∑× Select the sentence which has the largest index value with the right singular vector (column vector of V)

11 Summarizer 3: Summarization by Page Layout Analysis (WWW10,2001) In HTML content, a BO (Basic Object) is a non-breakable element within two tags or an embedded Object.

12 Summarizer 3: Summarization by Page Layout Analysis  Analyze the structure of Web Pages  Compute the similarity graph between objects  Nodes = Objects  Weight of Edge = Similarity  Get the core object  Extract the content body (CB) Header Search Box Main Body Navigation List Copyright

13 Summarizer 3: Summarization by Page Layout Analysis (Cont.) Detect the Content Body (CB) algorithm Consider each selected object as a single document and build the TF*IDF index for the object. Calculate the similarity between any two objects using Cosine similarity computation, and add a link between them if their similarity is greater than a threshold. A core object is defined as the object having the most edges. Extract the CB as the combination of all objects that have edges linked to the core object. Summary All sentences that are included in the content body give rise to the summary of the Web page.

14 Summarizer 3: Summarization by Page Layout Analysis (Cont.) Obj1Obj2Obj3Obj4 Obj Obj Obj Obj Content Body

15 Naïve Bayes Classifier (lecture of ML&DM berlin 2004) Assume target function, where each instance described by attributes Most probable value of is Naïve Bayes assumption: Naïve Bayes Classifier: predict the target value/classification

16 Given a data set Z with 3-dimensional Boolean examples. Train a naïve Bayes classifier to predict the classification What is the predicted probability ? Naïve Bayes: Example Attribu te A Attribut e B Attribut e C Classificatio n D FTFT FFTT TFFT TFFF FTTF FFTF

17 Naïve Bayes: Example

18 Summarizer 4: Supervised Summarization (SIGIR 1995) Features Given sentence S i F i1 : the position of a sentence S i in a certain paragraph; F i2 : the length of a sentence S i ; F i3 : ∑TFw*SFw; (Term frequency, Sentence Frequency) F i4 : the similarity (cosine) between S i and the title; F i5 : the similarity between S i and all text in the page; F i6 : the similarity between S i and meta-data in the page; F i7 : the number of occurrences of word in S i from a special word set (italic or bold or underlined words). F i8 : the average font size of the words in S i

19 Summarizer 4: Supervised Summarization (Cont.) Classifier Each sentence will then be assigned a score by the above equation. stands for the compression rate is the probability of each feature i is the conditional probability of each feature i

20 Ensemble Summarizers The final score for each sentence is calculated by summing the individual score factors obtained for each method used. Schema1: We assigned the weight of each summarization method in proportion to the performance of each method (the value of micro-F1). Schema2-5: We increased the value of w i (i=1, 2, 3, 4) to 2 in Schema2-5 respectively and kept others as one.

21 Experiment Setup Dataset 2 millions Web pages from the LookSmart Web directory. 500 thousand pages with manually created descriptions. Randomly sampled 30% (includes 153,019 pages). Distributed among 64 categories (Only the top two level categories on LookSmart Website) Classifier NB (Naïve Bayesian) and SVM (Support Vector Machine) Evaluation 10-fold cross validation Precision,Recall,F1 Micro (gives equal weight to every document) vs. Macro

22 Experiment 1: feasibility study Baseline: Remaining text by removing the html tags Human-authored summary as the ideal summary for the page; Conclusion: Good summary can improve classification performance obviously. NB SVM 14.8% 13.2%

23 Experiment 2: Evaluation on Automatic Summarizers Similar improvement among unsupervised methods Unsupervised methods are better than the supervised method All automatic methods are not as good as human summary microPmicroRmicro-F1 Baseline Human Summ Summ Summ Summ microPmicroRmicro-F1 Baseline Human Summ Summ Summ Supervised NB SVM Summ1 = Luhn; Summ2 = Content Body; Summ3 = LSA; Summ4 = Supervised

24 Experiment 2: Evaluation on Automatic Summarizers (Cont.) The ensemble of summarizers achieves similar improvement as human summary. 14.8% 12.9% 13.2% 11.5% NB SVM

25 Experiment3: Parameter Tuning --Compression Rate Compression rate is the most parameter in consideration Most of automatic methods achieve best result when the compression rate is 20% or 30% CB65.0± ± ± ±0.3 Performance of CB with different Threshold with NB 10%20%30%50% Luhn66.1± ± ± ±0.3 LSA66.3± ± ± ±0.3 Supervised66.1± ± ± ±0.3 Hybrid66.9± ± ± ±0.3 Performance on different compression rate with NB

26 Experiment3: Parameter Tuning --Weight Schema Schema1: weight of each summarization method in proportion to the performance of each method Schema2-5: Increase the value of (i=1, 2, 3, 4) to 2 in Schema2-5 respectively and kept others as one microPmicroRmicro-F1 Origin80.2± ± ±0.3 Schema181.0± ± ±0.3 Schema281.3± ± ±0.4 Schema379.5± ± ±0.4 Schema481.1± ± ±0.3 Schema579.7± ± ±0.4

27 Analysis Why summarization helps? Summarization can extract the main topic of a Web page while remove noises. #of pagesTotal Size(k)Average Size/page (k) A B A: 100 Web pages that are correctly labeled by all our summarization based approaches but wrongly labeled by the baseline system B: 500 pages randomly from the testing pages Conclusion: Summarization is helpful especially for large web pages

28 Conclusion Summarization techniques can be helpful for classification. A new summarizer based on Web-page structure analysis. Modification to Luhn’s method New features for Web-page supervised summarization

29 Future Work Improve the summarization performance Take Hypertext/Anchor Text into consideration. Use the Hyperlink Structure. Use Query Log. Apply Summarization to other applications Clustering

30 Thanks