WEB CONTENT SUMMARIZATION Timothy Washington A Look at Algorithms, Methodologies, and Live Systems.

Slides:

Advertisements

Similar presentations

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Advertisements

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Information Retrieval in Practice

Efficient Web Browsing on Handheld Devices Using Page and Form Summarization Orkut Buyukkokten, Oliver Kaljuvee, Hector Garcia-Molina, Andreas Paepcke.

Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.

1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Query Log Analysis Naama Kraus Slides are based on the papers: Andrei Broder, A taxonomy of web search Ricardo Baeza-Yates, Graphs from Search Engine Queries.

Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries.

Access 2007 ® Use Databases How can Access help you to find and use information?

Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.

Masquerade Detection Mark Stamp 1Masquerade Detection.

Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.

Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.

Search Engines and Information Retrieval Chapter 1.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

The identification of interesting web sites Presented by Xiaoshu Cai.

Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.

AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.

Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.

1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2006.

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.

Chapter 6: Information Retrieval and Web Search

A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,

Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.

Web Search Algorithms By Matt Richard and Kyle Krueger.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,

1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.

ITGS Databases.

Chapter 23: Probabilistic Language Models April 13, 2004.

Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.

DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.

Web- and Multimedia-based Information Systems Lecture 2.

Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.

Neural Network Implementation of Poker AI

CPT 499 Internet Skills for Educators Session Three Class Notes.

Vector Space Models.

Excel 2007 Part (3) Dr. Susan Al Naqshbandi

Post-Ranking query suggestion by diversifying search Chao Wang.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.

1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen Yi-Ting.

Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,

GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.

The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,

Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.

Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.

Information Retrieval in Practice

Text Based Information Retrieval

Multimedia Information Retrieval

Information Organization: Clustering

Data Integration for Relational Web

CS 430: Information Discovery

CS224N: Query Focused Multi-Document Summarization

INF 141: Information Retrieval

Presentation transcript:

WEB CONTENT SUMMARIZATION Timothy Washington A Look at Algorithms, Methodologies, and Live Systems

Why is web content summarization important?  The quantity of information available on the Internet has grown vastly and this trend will continue for years to come.  In contrast, however, the amount of time that web users desire to spend looking through this information continues to decrease.  Additionally, the development of handheld web devices (e.g. smartphones and, iPads) creates the need to reduce text for display on such small devices.

Two Main Phases of Web Content Summarization  Content Selection - weighing the importance of information throughout web documents and determining what is crucial to the web user’s general understanding of the material; determining what information is redundant and unnecessary.  Content Generation - structuring the selected content with proper syntactic structure and semantic representation

Algorithms and Methodologies used in Content Selection  TF*IDF (term frequency inverse document frequency)  Lexical Chains  Machine Learning (e.g. Noisy Channel Models and N- grams)  Clickthrough Data Evaluation

TF*IDF TF*IDF is a method used for summarizing multiple web documents. It selects sentences for a summary based on the statistical scores of words they contain. This score is computed from the combination of two equations. For term frequency we count the number of occurrences of a given word in the current web document being evaluated. With inverse document frequency we check to see how many web pages (out of the group that is being summarized) contain a given word. This number is divided into the number of web pages in the group. Next the log of the quotient is taken. After this is computed for each word, sentences with the highest scoring words are selected for the summary until some threshold is reached.

Lexical Chains  Lexical chaining is used in the summarization of single and multiple web documents. In lexical chaining words that have similar meaning are linked together through a data structure.  WordNet, an online database that groups words into sets of synonyms is a primary tool used in determining which words should be “chained”.  The strength of a lexical chain is based on the number of terms contained within it.  Usually, the first sentence from each strong chain is selected for the summary until some threshold is reached. An arctic cold wave, the worst in 10 years, hit parts of Europe, leaving many residents dead. Hardest hit by the sub-zero temperatures were Poland, Bulgaria, and Romania. Rare snowfall occurred in southern Italy. In Poland, three weeks of chilling temperatures killed at least 85 people in November, 29 more citizens than in all of the previous winter …

Noisy Channel Model  In the noisy channel model, a summarization system is trained on a set of documents and handwritten summaries, recording features of sentences selected (as well as features of those not selected) for the summary.  These features may include sentence length, average TF*IDF per word, sentence position in a paragraph, etc. They are represented in vectors as binary numbers, decimals, and strings.

Noisy Channel Model (continued) When web documents are run through the summarizer feature vectors are created for each sentence. Afterwards a machine learning algorithm such as Baye’s Rule (shown below) is applied for each sentence. A threshold can be used to place the N number of sentences with the highest values in the summary. P(s ∈ <S|F1,…,FN) = P(F1,F2,…,FN|s ∈ S) * P (s ∈ S) / P (F1, F2,…,FN) s – the given sentence S – the summary F1, F2,…,FN – sentence feature vector P(s ∈ <S|F1,…,FN) – the probability that s should be chosen for the summary P (s ∈ S) – the statistical probability ( not based on any features) of a sentence being chosen for a summary. P (F1, F2,…,FN) – the probability of a sentence vector in the document set matching that of the given sentence P(F1,F2,…,FN|s ∈ S) – the probability of a sentence vector in the summary set matching that of the given sentence

N-Grams  N-grams are sequences of N words in a given sentence that are evaluated based on the probability that they will occur in a given order.  The most popular forms of N-grams are unigrams (single words), bigrams (a set of two words), and tri-grams (a set of three words).  In the case of summarization, N-gram probability is computed based on the statistics collected from the document-summary training set.  For example, the probability for sets of bigrams in a candidate sentence appearing within sentences selected for the summary could be evaluated.  Once again this statistical probability can be compared to a threshold value used to determine whether or not a sentence is fit for summarization.  The shortcoming of such an implementation is that the subject matter of the document- summary pairs must be closely related to that of the web content.

Clickthrough Data Evaluation  Clickthrough data is created by recording data sets from search engines to represent user activity.  The data sets are represented as set, where user (u) enters data in a query (q) and from the results selects pages (p). Once this data has been collected over a given period of time there will be millions of these data sets available for evaluation.  Sentences can then be selected based on their number of significant words. In this case the significance of each word is determined by the number of times a user selected the given web page from the search engine after having entered a query containing the given word (select all q from the data set with value current p).  One obvious issue with using the clickthrough data is that certain web pages may be summarized that users have not accessed through a search engine. In these cases it is useful to employ some backup method (such as TF-IDF) of selecting useful material for the summary.

News In Essence  Online news article summarization system developed at the University of Michigan Ann Arbor  Unique in that it uses a Trolling Algorithm to determine the group of documents to summarize itself rather than taking a preset group of documents.

News In Essence (continued) Input : SeedUrl, SitesToSearch, ExitConditions Output : Cluster Cluster<-SeedUrl WeightedKeywords<-get_common_keywords(SeedUrl, SeedUrl) LinkedUrls<-get_links(SeedUrl) //primary search while UrlToTest<- next(LinkedUrls) && PrimaryExitCondition != true if follows_useful_rules(UrlToTest) LinkedUrls<- LinkedUrls + get_links(UrlToTest) if follows_article_rules(UrlToTest) && (similarity(SeedUrl, UrlToTest) > threshold) Cluster<- Cluster + UrlToTest WeightedKeyWords<- WeightedKeyWords + get_common_keywords(SeedUrl, UrlToTest) SecSearchKeyWords<- max_n(WeightedKeyWords) //secondary search while SearchSite<-next(SitesToSearch) && SecondaryExitCondition != true SearchPage<- generate_search(SearchSite, SecSearchKeyWords) LinkedUrls<- get_links(SearchPage) while UrlToTest<- next(LinkedUrls) && SecondaryExitCondition != true if follows_useful_rules(UrlToTest) LinkedUrls<- LinkedUrls + get_links(UrlToTest) if follows_article_rules(UrlToTest) && (similarity(SeedUrl, UrlToTest) > threshold) Cluster<- Cluster + UrlToTest Return Cluster  Starts by looking through a single web page, known as the SeedUrl. Searches this page for links to other articles that share the same topic.  After adding the linked pages to the document set, it uses TF*IDF to compute the words of most importance between the web documents.  In the next phase it takes these words and uses them to query several news search engines to find additional articles and add them to the document set.  It additionally follows links from these search results to see if there are additional pages that can be used for summarization.  Uses a series of rules to determine whether linked pages should be excluded from the set. One such rule is that if a URL ends with “.jpg” then this is evidence of a page that contains no textual content, therefore deeming it unfit for the document set.

Web Page Summarization using Clickthrough Data  System developed at Tsing Hua University.  Determines the importance of a word using weighted term frequency within web pages and term frequency within query data.  Each of the frequencies is weighted based on the variable between 0 and 1.  As shown above the closer this variable is to 0 the more the word importance will be based on term frequency within the web pages. Likewise, the closer the variable is to 1 the more this ranking will be based on clickthrough data.

Web Page Summarization using Clickthrough Data (continued) After the system determines the important terms it applies Luhn’s algorithm (shown below) to each sentence. 1) Set a limit L for the distance at which any two significant words could be considered related. 2) Finds a portion in the sentence that is bracketed by significant words not more than L non-significant words apart. 3) Counts the number of significant words contained in this portion and divides the square of this number by the total number of words in the portion. The result is the significant factor of each sentence. Sample: A default could devastate the economy, leading to a crashing dollar and skyrocketing interest rates, among other things. L =4 Left bracket word = economy Right bracket word = interest Significant words in portion = 3 Total words in portion = 7 Significance factor of sentence = sqrt(3)/7 = 0.247

Web Page Summarization using Clickthrough Data (continued)

NewsBlaster

NewsBlaster (continued)  Weights sentences in web pages based on various features. Some of which include the following: Location: If a sentence appears late in a document then its weight is reduced as these sentences are more likely to contain information that is redundant or not as important. Publication Date: The more recent an article that the sentence belongs to the greater the weight that it is given. This is to help ensure that web users receive the most up to date information. Length: Sentences were weighted negatively if they have a length of more than 30 words or less than 15 words. Long sentences are thought to contain redundant information whereas short sentences are thought to be limited to supporting information. Pronoun: A negative weight is given for sentences that start with pronouns as these sentences usually contain supporting information and not primary information. Sentences with the highest weights are selected. Afterwards they are fused together using a separate algorithm.

Tests Run on NewsBlaster Three document summary sets from NewsBlaster were tested based on the following criteria:  Summary Length – the length of the summary compared to the average length of the original documents  Precision – portion of information in the summary that belongs in the summary (in this case this was based on the number of sentences in each summary relevant to the title of the summary  Sentence structure - since this is an abstractive system (constructing and generating new sentences) each sentence was tested for proper grammatical structure by running it through the Stanford parser

Results for Summary Length

Result Set for Precision

Conclusion  Many advances have been made in this field with the development of algorithms an methodologies such as TF-IDF, clickthrough data evaluation, and machine learning.  It is also clear that further research is needed in this area; mainly in improving precision beyond all else as many summarization systems have issues in this area.