Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma TsingHua University, *Microsoft.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Ensemble Learning: An Introduction
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.
WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Similar Document Search and Recommendation Vidhya Govindaraju, Krishnan Ramanathan HP Labs, Bangalore, India JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Adding Semantics to Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Algorithmic Detection of Semantic Similarity WWW 2005.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
iSRD Spam Review Detection with Imbalanced Data Distributions
Presentation transcript:

Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma TsingHua University, *Microsoft Research Asia, **Hong Kong University (SIGIR 2004)

2/29 Abstract Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy

3/29 Abstract Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12% improvement over pure-text based methods

4/29 Introduction With the rapid growth of WWW, there is an increasing need to provide automated assistance to Web users for Web page classification and categorization Yahoo directory LookSmart directory However, it is difficult to meet without automated Web-page classification techniques due to the labor- intensive nature of human editing

5/29 Introduction Web pages have their own underlying embedded structure in the HTML language and typically contain noisy content such as advertisement banner and navigation bar If a pure-text classification is applied to these pages, it will incur much bias for the classification algorithm, making it possible to loss focus on the main topic and important content Thus, a critical issue is to design an intelligent preprocessing technique to extract the main topic of a Web page

6/29 Introduction In this paper, we show that using Web-page summarization techniques for preprocessing in Web-page classification is a viable and effective technique We further show that instead of using off-the- shelf summarization technique that is designed for pure-text summarization, it is possible to design specialized summarization methods catering to Web-page structures

7/29 Related Work Recent works on Web-page summarization Ocelot (Berger al et., 2000) is a system for summarizing Web pages using probabilistic models to generate the “gist” of a Web page Buyukkokten et al. (2001) introduces five methods for summarizing parts of Web pages on handheld devices Delort (2003) exploits the effect of context in Web page summarization, which consists of the information extracted from the content of all the documents linking to a page

8/29 Related Work Our work is related to that for removing noise from a Web page Yi et al. (2003) propose an algorithm by introducing a tree structure, called Style Tree, to capture the common presentation styles and the actual contents of the pages Chen et al. (2000) extract title and meta-data included in HTML tags to represent the semantic meaning of the Web pages

9/29 Web-Page Summarization Adapted Luhn’s Summarization Method Every sentence is assigned with a significance factor, and the sentence with the highest significance factor are selected to form the summary The significance factor of a sentence can be computed as follows. A “significant words pool” is built using word frequency Set a limit L for the distance at which any two significant words could be considered as being significantly related Find out a portion in the sentence that is bracketed by significant words not more than L non-significant words apart Count the number of significant words contained in the portion and divide the square of this number by the total number of words within the portion

10/29 Web-Page Summarization In order to customize the Luhn’s algorithm for web-pages, we make a modification The category information of each page is already known in the training data, thus significant-words selection could be processed with each category We build significant words pool for each category by selecting the words with high frequency after removing the stop words For a testing Web page, we do not have the category information We calculate the significant factor for each sentence according to different significant words pools over all categories separately The significant factor for the target sentence will be averaged over all categories and referred to as S luhn

11/29 Web-Page Summarization Latent semantic Analysis (LSA) LSA is applicable in summarization because of two reasons: LSA is capable of capturing and modeling interrelationship among terms by semantically clustering terms and sentences LSA can capture the salient and recurring word combination pattern in a document which describe a certain topic or concept LSA is based on singular value decomposition A :m*n :n*n V :n*n

12/29 Web-Page Summarization Latent semantic Analysis (LSA) Concepts can be represented by one of the singular vectors where the magnitude of the corresponding singular value indicates the importance degree of this pattern within the document Any sentence containing this word combination pattern will be projected along this singular value The sentence that best represents this pattern will have the largest index value with this vector We denote this index value as S lsa and select the sentences with the highest S lsa to form the summary

13/29 Web-Page Summarization Content Body Identification by Page Layout Analysis The structured character of Web pages makes Web-page summarization different from pure-text summarization In order to utilize the structure information of Web pages, we employ a simplified version of Function-Based Object Model (Chen et al., 2001) Basic Object (BO) Composite Object (CO) After detecting all BOs and COs in a Web page, we could identify the category of each object according to some heuristic rules: Information Object Navigation Object Interaction Object, Decoration Object Special Function Object

14/29 Web-Page Summarization Content Body Identification by Page Layout Analysis In order to make use of these objects, we define the Content Body (CB) of a Web page which consists of the main object related to the topic of that page The algorithm for detecting CB is as follows: Consider each selected object as a single document and build the TF*IDF index for the object Calculate the similarity between any two objects using COSINE similarity computation and add a link between them if their similarity is greater than a threshold (=0.1) In the graph, a core object is defined as the object having the most edges Extract the CB In the sentence is included in CB, S cb =1, otherwise, S cb =0 All sentences with S cb =1 give rise to the summary

15/29 Web-Page Summarization Supervised Summarization There are eight features f i1 measures the position of a sentence S i in a certain paragraph f i2 measures the length of a sentence S i, which is the number of words in S i f i3 =ΣTF w *SF w f i4 is the similarity between S i and the title f i5 is the consine similarity between S i and all text in the pages f i6 is the consine similarity between S i and meta-data in the page f i7 is the number of occurrences of word from S i in a special word set f i8 is the average font size of words in S i

16/29 Web-Page Summarization Supervised Summarization We apply Naïve Bayesian classifier to train a summarizer

17/29 Web-Page Summarization An Ensemble of Summarizer By combining the four methods presented in the previous sections, we obtain a hybrid Web-page The sentences with the highest S will be chosen into the summary

18/29 Experiments Data Set We use about 2 millions Web pages crawled from the LookSmart Web directory Due to the limitation of network bandwidth, we only downloaded about 500 thousand descriptions of Web pages that are manually created by human editors We randomly sampled 30% of the pages with descriptions for our experiment purpose The Extracted subset includes 153,019 pages, which are distributed among 64 categories

19/29 Experiments Data Set In order to reduce the uncertainty of data split, a 10- fold cross validation procedure is applied

20/29 Experiments Classifiers Naïve Bayesian Classifier (NB) Support Vector Machine (SVM)

21/29 Experiments Experiment Measure We employ the standard measures to evaluate the performance of Web classification: Precision, recall and F1-measure To evaluate the average performance across multiple categories, there are two conventional methods: micro- average and macro-average Micro-average gives equal weight to every document Macro-average gives equal weight to every category, regardless of its frequency Only micro-average will be used to evaluate the performance of classifiers

22/29 Experiments Experimental Results and Analysis Baseline (Full text) Each web page is represented as a bag-of-words, in which the weight of each word is assigned with their term frequency Human’s summary (Description, Title, meta-data) We extract the description of each Web page from the LookSmart Website and consider it as the “ideal” summary

23/29 Experiments Experimental Results and Analysis Unsupervised summarization Content Body, Adapted Luhn’s algorithm (compression rate:20%) and LSA (compression rate:30%) Supervised summarization algorithm We define one sentence as positive if its similarity with the description is greater than a threshold (=0.3), and others as negative Hybrid summarization algorithm Use all Unsupervised and supervised summarization algorithms

24/29 Experiments Experimental Results and Analysis

25/29 Experiments Experimental Results and Analysis Performance on Different Compression

26/29 Experiments Experimental Results and Analysis Effect of different weighting schemata Schema1: we assigned the weight of each summarization method in proportion to the performance of each method (micro-F1) Schema2: We increased the value of Wi (i=1,2,3,4) to 2 in schema2-5 respectively and kept others as one

27/29 Experiments Case studies We randomly selected 100 Web pages that are correctly labeled by all our summarization based approaches but wrongly labeled by the baseline (denoted as set A) and 500 pages randomly from the testing pages (denoted as set B) The average size of pages in A which is 31.2k in much larger than that of in B which is 10.9k The summarization techniques can help us to extract this useful information from large pages

28/29 Experiments Case studies

29/29 Conclusions and Future Work Several web-page summarization algorithms are proposed for extracting most relevant features from Web pages for improving the accuracy of Web classification Experimental results show that automatic summary can achieve a similar improvement (about 12.9%) as the ideal-case accuracy achieved by using the summary created by human editors We will investigate methods for multi-document summarization of hyperlinked Web pages to boost the accuracy of Web classification