New Event Detection at UMass Amherst Giridhar Kumaran and James Allan.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

DISCOVERING EVENT EVOLUTION GRAPHS FROM NEWSWIRES Christopher C. Yang and Xiaodong Shi Event Evolution and Event Evolution Graph: We define event evolution.
A probabilistic model for retrospective news event detection
Chapter 5: Introduction to Information Retrieval
Improved TF-IDF Ranker
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Automatic Timeline Generation Jessica Jenkins Josh Taylor CS 276b.
Self Organization of a Massive Document Collection
Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.
UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Automatic Classification of Semantic Relations between Facts and Opinions Koji Murakami, Eric Nichols, Junta Mizuno, Yotaro Watanabe, Hayato Goto, Megumi.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Over the last years, the amount of malicious code (Viruses, worms, Trojans, etc.) sent through the internet is highly increasing. Due to this significant.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Vectors Readings: Chapter 3. Vectors Vectors are the objects which are characterized by two parameters: magnitude (length) direction These vectors are.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Patterns of Event Causality Suggest More Effective Corrective Actions Abstract: The Occurrence Reporting and Processing System (ORPS) has used a consistent.
Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ Text Categorization For Turkish News.
Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Beyond Sliding Windows: Object Localization by Efficient Subwindow Search The best paper prize at CVPR 2008.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Calculating cosine for two vectors 1 Given two vectors and : 1 2 x2x2 x1x1 y1y1 y2y2 By using formula [2], we can write: Since and, and using [1]: By using.
Linear Classifiers Rubine & CA-Linear Ruben Balcazar.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
TDT 2004 Unsupervised and Supervised Tracking Hema Raghavan UMASS-Amherst at TDT 2004.
Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst.
Link Distribution on Wikipedia [0422]KwangHee Park.
Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.
IR 6 Scoring, term weighting and the vector space model.
Multi-document Summarization Sandeep Sripada Venu Gopal Kasturi Gautam Kumar Parai.
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Semantic Processing with Context Analysis
MMS Software Deliverables: Year 1
Exploiting Topic Pragmatics for New Event Detection in TDT-2004
Retrieval Utilities Relevance feedback Clustering
Text Mining Application Programming Chapter 9 Text Categorization
Connecting the Dots Between News Article
Presentation transcript:

New Event Detection at UMass Amherst Giridhar Kumaran and James Allan

CIIR, UMass Amherst2 Preprocessing  Lemur Toolkit for tokenization, stopping, k-stemming  BBN Identifinder™ for extracting named entities

CIIR, UMass Amherst3 Systems fielded  Submitted four systems  Didn’t include last year’s system Classification according to LDC categories and term – pruning Didn’t work on exclusively NW story corpus

CIIR, UMass Amherst4 Primary system – UMass1  Utility of named entities acknowledged  Failure analysis indicates Large number of old stories have low confidence score (false alarms) Conflict with new story scores Reasons  Stories on multiple topics  Diffuse topics  Varying document lengths

CIIR, UMass Amherst5 Primary system – UMass1  Focus Identify old stories better – affects cost  Clue Most old stories get low confidence scores as topics linked by  only named entities (large number)  only non-named entities (few)

CIIR, UMass Amherst6 Primary system – UMass1  Approach Look at the set of closest matching stories If consistently high named entity or non-named entity match modify confidence score

CIIR, UMass Amherst7 Primary system – UMass1  Procedure Double original confidence score if less than a threshold Gradually reduce score towards original score if set of closest stories match neither named entities nor non-named entities

CIIR, UMass Amherst8 UMass1 – Examples from TDT3  Russian Financial Crisis - Old Story APW AllSimNESimnoNESim APW APW APW APW APW APW

CIIR, UMass Amherst9 UMass1 – Examples from TDT3  Russian Financial Crisis - Old Story APW AllSimNESimnoNESim APW APW APW APW APW APW

CIIR, UMass Amherst10 UMass1 – Examples from TDT3  Russian Financial Crisis - Old Story APW AllSimNESimnoNESim APW APW APW APW APW APW Threshold = 0.1

CIIR, UMass Amherst11 UMass1 – Examples from TDT3  Russian Financial Crisis - Old Story APW AllSimNESimnoNESim APW APW APW APW APW APW Threshold = 0.1

CIIR, UMass Amherst12 UMass1 – Examples from TDT3  Russian Financial Crisis - Old Story APW AllSimNESimnoNESim APW * APW APW APW APW APW Threshold = 0.1

CIIR, UMass Amherst13 UMass1 – Examples from TDT3  Thai Airbus Crash - New Story APW AllSimNESimnoNESim APW * APW APW APW APW APW

CIIR, UMass Amherst14 UMass1 on TDT3

CIIR, UMass Amherst15 UMass1 on TDT3

CIIR, UMass Amherst16 UMass2  Basic vector space model system  Compare with all preceding stories  Return highest cosine match

CIIR, UMass Amherst17 UMass3  Same model as UMass2  TDT5 – Very large collection  Practical system  Compare with a maximum of stories with highest coordination match Faster

CIIR, UMass Amherst18 UMass4  Similar to UMass1  Rationale is the same  Consider top five matches  Use different formula for modifying confidence score

CIIR, UMass Amherst19 Performance Summary System Topic weighted min. cost (TDT5) Topic weighted min. cost (TDT4) UMass1 – Modify confidence score based on evidence UMass2 – Basic vector space model UMass3 – UMass2 + restriction on number of documents compared with UMass4 – UMass1 with different formula

CIIR, UMass Amherst20 Summary  Basic vector space model did the best  Restricting number of stories to be compared with Improved system speed Didn’t improve performance  Primary system did extremely well on training data, but failed on TDT5