Automatic Timeline Generation Jessica Jenkins Josh Taylor CS 276b.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
A probabilistic model for retrospective news event detection
A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Data Mining CS 341, Spring 2007 Project Discussion.
Data Mining CS 341, Spring 2007 Final Project: presentation & report & codes.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Integrating Topics and Syntax Paper by Thomas Griffiths, Mark Steyvers, David Blei, Josh Tenenbaum Presentation by Eric Wang 9/12/2008.
Topic Orientation + Information Ordering Syed Sameer Arshad Tristan Chong.
Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany,
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Text mining.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Query Based Event Extraction along a Timeline H.L. Chieu and Y.K. Lee DSO National Laboratories, Singapore (SIGIR 2004)
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Automatic Identification of Pro and Con Reasons in Online Reviews Soo-Min Kim and Eduard Hovy USC Information Sciences Institute Proceedings of the COLING/ACL.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
1 DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech John Makhoul, Rich Schwartz, Alex Baron, Ivan Bulyko, Long Nguyen,
National Taiwan University, Taiwan
A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments CIKM2004 Speaker : Yao-Min Huang Date : 2005/03/10.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
A Maximum Entropy Based Honorificity Identification for Bengali Pronominal Anaphora Resolution Apurbalal Senapati and Utpal Garain Presented by Samik Some.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Natural Language Processing Topics in Information Retrieval August, 2002.
2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
CS 430: Information Discovery
Presentation transcript:

Automatic Timeline Generation Jessica Jenkins Josh Taylor CS 276b

Corpus Subset of TDT-3 text news articles 875 articles 49 topics (collections of related articles) 6-30 articles per topic Sources: ABC, APW, CNN, NBC, NYT, VOA Focus on sentence-level event detection within a topic

Labeling / Data Processing Manual annotation for evaluation 10 topics, 120 articles Decide on a set of events for each topic Annotate each sentence in each article with a set of relevant topic events Sentence boundary detection MXTerminator: maximum entropy classifier (Reynar and Ratnaparkhi)

Data Processing Sentences tokenized with case-folding, punctuation stripping, Porter stemming, and English stop-word removal Part of Speech tagging Stanford Log-linear PoS Tagger (Toutanova and Manning) Noun Phrase Chunking BaseNP Chunking (Ramshaw and Marcus) Only retained verbs for non-NP terms

Evaluation Precision and recall analogs Defined relative to set of annotated events NU-precision: count of sentences labeled with novel events (relative to higher- ranked sentences) over number of selected sentences NU-recall: number of events in selected sentences over number of events in topic

Event Selection Methods Language model-based sentence scoring (Allan, Gupta, Khandelwal) “ Useful ” score: measure of how likely a sentence is to be on-topic. “ Novel ” score: indication that sentence content is unlike previous sentences Sentences ranked using a weighted blend of their Useful and Novel score

Sample Timeline Top 5 sentences by score, ordered chronologically: As Mr. Sharif left the White House, he said the talks focused on the India-Pakistan dispute over the Kashmir region and U.S. concerns about nuclear arms tests conducted by India and Pakistan. In 1989, Pakistan paid the U.S. close to $700 million for 28 F-16 fighter planes. But Pakistan was never refunded for the F-16s. In return, Pakistan will withdraw its claim to the F-16 aircraft. Delivery of the jets was stopped in 1990 because of the U.S. arms embargo against Pakistan because of its nuclear program.

Event Selection Methods Event discovery through clustering Similar to text summarization Each cluster represents an event Hierarchical agglomerative clustering with sparse vectors K-means (k=5,10,20) with and without dimensionality reduction

Clustering Results Tried using group average HAC without dimensionality reduction Representative sentence chosen for maximal cosine similarity with cluster centroid Underwhelming results with respect to evaluation metrics

Sample Timeline 2 Top 5 “ useful ” cluster representatives, taken from 15 clusters, ordered chronologically: Mr. Clinton also said he and Mr. Sharif would try to resolve a dispute over a canceled sale of U.S. fighter aircraft. Mr. Clinton wants India and Pakistan, which each conducted nuclear tests last May, to sign the Comprehensive Test Ban Treaty as they have indicated they would do. All of you know my concern to do everything we can to end the nuclear competition in South Asia which I believe is a threat to Pakistan and India and to the stability of the world. The delivery was blocked under a 1990 law barring direct U.S. military sales to Pakistan because of its development of nuclear weapons. He met with Pakistan's Prime Minister Omar Sharif.

Clustering in Progress Using SVD dimensionality reduction Noun phrase features with part-of- speech filtering Representative sentences chosen from each cluster based on Useful score

Problem Areas Small annotated corpus Few annotated events per topic give a coarse evaluation Weak correspondence between clusters and labeled events

Unexplored Possibilities Detection and exploitation of temporal features in sentences Clustering with different sentence similarity measures WordNet-derived semantic distance