1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Tues 4-5; Wed 1-2 TA: Yves Petinot 728 CEPSR, 939-7116.

Slides:



Advertisements
Similar presentations
1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
Advertisements

Ani Nenkova Lucy Vanderwende Kathleen McKeown SIGIR 2006.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Engineering Village ™ ® Basic Searching On Compendex ®
Copyright © 2003 by The McGraw-Hill Companies, Inc. All rights reserved. Business and Administrative Communication SIXTH EDITION.
Information Retrieval in Practice
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Search Engines and Information Retrieval
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
1 I256: Applied Natural Language Processing Marti Hearst Oct 2, 2006.
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Mon 3-4 TA: Fadi Biadsy 702 CEPSR,
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
SIGIR 2001 – WTS / DUC13 Sep 20011/28 Centrifuser Output Min Yen Kan, 2001 Centrifuser’s output comes in three parts: Navigation; Informative extract,
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
Overview of Search Engines
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Paper versus speech versus poster: Different formats for communicating research.
Educator’s Guide Using Instructables With Your Students.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2006.
Tamil Summary Generation for a Cricket Match
ENG - W232 The Formal Business Report. Audience supervisors at Air America.
Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Reference Julian Kupiec, Jan Pedersen, Francine Chen, “A Trainable Document Summarizer”, SIGIR’95 Seattle WA USA, Xiaodan Zhu, Gerald Penn, “Evaluation.
Presenter: Shanshan Lu 03/04/2010
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
1 KINDS OF PARAGRAPH. There are at least seven types of paragraphs. Knowledge of the differences between them can facilitate composing well-structured.
Chapter 23: Probabilistic Language Models April 13, 2004.
Call to Write, Third edition Chapter Two, Reading for Academic Purposes: Analyzing the Rhetorical Situation.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Understanding User Goals in Web Search University of Seoul Computer Science Database Lab. Min Mi-young.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Chapter 3 Critically reviewing the literature
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
NATURAL LANGUAGE PROCESSING Zachary McNellis. Overview  Background  Areas of NLP  How it works?  Future of NLP  References.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç
Natural Language Processing Vasile Rus
Information Retrieval in Practice
Search Engine Architecture
Improving a Pipeline Architecture for Shallow Discourse Parsing
Natural Language Processing for the Web
Introduction to Information Retrieval
Introduction to Search Engines
Presentation transcript:

1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Tues 4-5; Wed 1-2 TA: Yves Petinot 728 CEPSR, Office Hours: Thurs 12-1, 8-9

2 Today  Why NLP for the web?  What we will cover in the class  Class structure  Requirements and assignments for class  Introduction to summarization

3 The World Wide Web  Surface Web  As of March 2009, the indexable web contains at least billion web pages  eb&action=edit eb&action=edit  On July 25, 2008, Google software engineers Jesse Alpert and Nissan Hajaj announced that Google Search had discovered one trillion unique URLs.  As of May 2009, over million websites operated.  Deep Web  550 billion web pages (2001) both surface and deep  At least billion in the deep web (2005)

4 Languages on the web (2002) Languages on the web (2002)  English 56.4%  German 7.7%  French 5.6%  Japanese 4.9%

5 Language Usage of the Web

6 Locally maintained corpora  Newsblaster  Drawn from between news sites  Accumulated since 2001  2 billion words  DARPA GALE corpus  Collected by the Linguistic Data Consortium  3 different languages (English, Arabic, Chinese)  Formal and informal genres u News vs. blogs u Broadcast news vs. talk shows  367 million words, 2/3 in English  4500 hours of speech  Linguistic Data Consortium (LDC) releases  Penn Treebank, TDT, Propbank, ICSI meeting corpus  Corpora gathered for project on online communication  LiveJournal, online forums, blogs

7 What tasks need natural language?  Search  Asking questions, finding specific answers (google)  Browsing ( ion/en/latest.htmlhttp://emm.newsbrief.eu/NewsBrief/clusteredit ion/en/latest.html)  Analysis of documents  Sentiment (  Who talks to who?  Translation (google)

8 Existing Commercial Websites  Google News  Ask.com  Yahoo categories  Systran translation

9 Exploiting the Web  Confirming a response to a question  Building a data set  Building a language model

10 Class Overview  Userid: nlpforweb  Password: nlp321

11 Guest: Livia Polanyi Microsoft: bing.com

12 Summarization

13 What is Summarization?  Data as input (database, software trace, expert system), text summary as output  Text as input (one or more articles), paragraph summary as output  Multimedia in input or output  Summaries must convey maximal information in minimal space

14 Summarization is not the same as Language Generation  Karl Malone scored 39 points Friday night as the Utah Jazz defeated the Boston Celtics  Karl Malone tied a season high with 39 points Friday night….  … the Utah Jazz handed the Boston Celtics their sixth straight home defeat Streak, Jacques Robin, 1993

15 Summarization Tasks  Linguistic summarization: How to pack in as much information as possible in as short an amount of space as possible?  Streak: Jacques Robin  Jan 28 th class: single document summarization  Conceptual summarization: What information should be included in the summary?

16 Streak  Data as input  Linguistic summarization  Basketball reports

17 Input Data -- STREAK

18 Revision rule: nominalization beat JazzCeltics hand JazzdefeatCeltics Allows the addition of noun modifiers like a streak (6 th straight defeat)

19 Summary Function (Style)  Indicative  indicates the topic, style without providing details on content.  Help a searcher decide whether to read a particular document  Informative  A surrogate for the document  Could be read in place of the document  Conveying what the source text says about something  Critical  Reviews the merits of a source document  Aggregative  Multiple sources are set out in relation, contrast to one anohter

20 Indicative Summarization – Min Yen Kan, Centrifuser

SIGIR 2001 – WTS / DUC13 Sep /28 Centrifuser Output Min Yen Kan, 2001 Centrifuser’s output comes in three parts: Navigation; Informative extract, based on similarities; Indicative generated text, based on differences. Centrifuser can currently produce this output for documents with the same domain and genre

SIGIR 2001 – WTS / DUC13 Sep /28 1. Document Topic Tree  Hierarchical view of the document Layout (Hu, et al 99) Lexical chains (Hearst 94, Choi 00) Done offline per document  AHA Recommendation Level: 2 Order: 1 Style: Prose Contents: 1 Table, … Related AHA publications Level: 2 Order:3 Style: Bulleted Contents: … See also in this guide Level: 2 Order: 3 Style: Prose Contents: 5 items, … High Blood Pressure Level: 1 Style: Prose Contents: 3 Headers, …

23 Other Dimensions to Summarization  Single vs. Multi-document  Purpose  Briefing  Generic  Focused  Media/genre  News: newswire, broadcast  /meetings

24 Summons -1995, Radev&McKeown  Multi-document  Briefing  Newswire  Content Selection

25 Summons, Dragomir Radev, 1995

26 BriefingsBriefings Transitional  Automatically summarize series of articles  Input = templates from information extraction  Merge information of interest to the user from multiple sources  Show how perception changes over time  Highlight agreement and contradictions  Conceptual summarization: planning operators  Refinement (number of victims)  Addition (Later template contains perpetrator)

27 How is summarization done?  4 input articles parsed by information extraction system  4 sets of templates produced as output  Content planner uses planning operators to identify similarities and trends  Refinement (Later template reports new # victims)  New template constructed and passed to sentence generator

28 Sample Template

29 How does this work as a summary?  Sparck Jones:  “With fact extraction, the reverse is the case ‘what you know is what you get.’” (p. 1)  “The essential character of this approach is that it allows only one view of what is important in a source, through glasses of a particular aperture or colour, regardless of whether this is a view showing the original author would regard as significant.” (p. 4)

30 Foundations of Summarization – Luhn; Edmunson  Text as input  Single document  Content selection  Methods  Sentence selection  Criteria

31 Sentence extraction  Sparck Jones:  `what you see is what you get’, some of what is on view in the source text is transferred to constitute the summary

32 Luhn 58  Summarization as sentence extraction  Example  Term frequency determines sentence importance  TF*IDF (term frequency * inverse document frequency)  Stop word filtering (remove “a”, “in” “and” etc.)  Similar words count as one  Cluster of frequent words indicates a good sentence

33 Edmunson 69 Sentence extraction using 4 weighted features:  Cue words  Title and heading words  Sentence location  Frequent key words

34 Sentence extraction variants  Lexical Chains  Barzilay and Elhadad  Silber and McCoy  Discourse coherence  Baldwin  Topic signatures  Lin and Hovy

35 Summarization as a Noisy Channel Model  Summary/text pairs  Machine learning model  Identify which features help most

36 Julian Kupiec SIGIR 95 Paper Abstract  To summarize is to reduce in complexity, and hence in length while retaining some of the essential qualities of the original.  This paper focusses on document extracts, a particular kind of computed document summary.  Document extracts consisting of roughly 20% of the original can be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summaries.  The trends in our results are in agreement with those of Edmundson who used a subjectively weighted combination of features as opposed to training the feature weights with a corpus.  We have developed a trainable summarization program that is grounded in a sound statistical framework.

37 Statistical Classification Framework  A training set of documents with hand-selected abstracts  Engineering Information Co provides technical article abstracts  188 document/summary pairs  21 journal articles  Bayesian classifier estimates probability of a given sentence appearing in abstract  Direct matches (79%)  Direct Joins (3%)  Incomplete matches (4%)  Incomplete joins (5%)  New extracts generated by ranking document sentences according to this probability

38 Features  Sentence length cutoff  Fixed phrase feature (26 indicator phrases)  Paragraph feature  First 10 paragraphs and last 5  Is sentence paragraph-initial, paragraph-final, paragraph medial  Thematic word feature  Most frequent content words in document  Upper case Word Feature  Proper names are important

39 Evaluation  Precision and recall  Strict match has 83% upper bound  Trained summarizer: 35% correct  Limit to the fraction of matchable sentences  Trained summarizer: 42% correct  Best feature combination  Paragraph, fixed phrase, sentence length  Thematic and Uppercase Word give slight decrease in performance

40 What do most recent summarizers do?  Statistically based sentence extraction, multi-document summarization  Study of human summaries (Nenkova et al 06) show frequency is important u High frequency content words from input likely to appear in human models u 95% of the 5 content words with high probably appeared in at least one human summary u Content words used by all human summarizers have high frequency u Content words used by one human summarizer have low frequency

41 How is frequency computed?  Word probability in input documents (Nenkova et al 06)  TF*IDF considers input words but takes words in background corpus into consideration  Log-likelihood ratios (Conroy et al 06, 01)  Uses a background corpus  Allows for definition of topic signatures  Leads to best results for greedy sentence by sentence multi-document summarization of news

42 New summarization tasks  Query focused summarization  Update summarization  Medical journal summarization  Weblog summarization  Meeting summarization  summarization

43 Karen Sparck Jones Automatic Summarizing: Factors and Directions

44 Sparck Jones claims  Need more power than text extraction and more flexibility than fact extraction (p. 4)  In order to develop effective procedures it is necessary to identify and respond to the context factors, i.e. input, purpose and output factors, that bear on summarising and its evaluation. (p. 1)  It is important to recognize the role of context factors because the idea of a general-purpose summary is manifestly an ignis fatuus. (p. 5)  Similarly, the notion of a basic summary, i.e., one reflective of the source, makes hidden fact assumptions, for example that the subject knowledge of the output’s readers will be on a par with that of the readers for whom the source ws intended. (p. 5)  I believe that the right direction to follow should start with intermediate source processing, as exemplified by sentence parsing to logical form, with local anaphor resolutions

45 Questions (from Sparck Jones)  Would sentence extraction work better with a short or long document? What genre of document?  Should it be more important to abstract rather than extract with single document or with multiple document summarization?  Is it necessary to preserve properties of the source? (e.g., style)  Does subject matter of the source influence summary style (e.g, chemical abstracts vs. sports reports)?  Should we take the reader into account and how?  Is the state of the art sufficiently mature to allow summarization from intermediate representations and still allow robust processing of domain independent material?

46 For the next two classes  Consider the papers we read in light of Sparck Jones’ remarks on the influence of context:  Input  Source form, subject type, unit  Purpose  Situation, audience, use  Output  Material, format, style