UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

Slides:

Advertisements

Similar presentations

Metadata in Carrot II Current metadata –TF.IDF for both documents and collections –Full-text index –Metadata are transferred between different nodes Potential.

Advertisements

1 Opinion Summarization Using Entity Features and Probabilistic Sentence Coherence Optimization (UIUC at TAC 2008 Opinion Summarization Pilot) Nov 19,

Chapter 5: Introduction to Information Retrieval

Introduction to Information Retrieval

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A

Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.

UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor.

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.

Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.

Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.

Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Cross Language IR Philip Resnik Salim Roukos Workshop on Challenges in Information Retrieval and Language Modeling Amherst, Massachusetts, September 11-12,

SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

 Manmatha MetaSearch R. Manmatha, Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Advance Information Retrieval Topics Hassan Bashiri.

User Modeling Thoughts on LMs James Allan Center for Intelligent Information Retrieval University of Massachusetts, Amherst September 11, 2002.

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Chapter 5: Information Retrieval and Web Search

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Finding Similar Questions in Large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee Retrieval Models for Question and Answer Archives.

Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.

Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.

Chapter 6: Information Retrieval and Web Search

Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

TDT 2002 Straw Man TDT 2001 Workshop November 12-13, 2001.

Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.

1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko.

1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG,

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.

Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17.

NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.

Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.

Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.

TDT 2000 Workshop Lessons Learned These slides represent some of the ideas that were tried for TDT 2000, some conclusions that were reached about techniques.

A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.

New Event Detection at UMass Amherst Giridhar Kumaran and James Allan.

Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,

TDT 2004 Unsupervised and Supervised Tracking Hema Raghavan UMASS-Amherst at TDT 2004.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst.

Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.

IR 6 Scoring, term weighting and the vector space model.

F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo

Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab

Web Information retrieval (Web IR)

Presentation transcript:

UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts, Amherst

Work on Story Link Detection Active work on SLD –Not ready in time for official submission Story “smoothing” using query expansion Score normalization based on language pair

What is LCA? Local Context Analysis –Query expansion technique from IR –More stable than other “pseudo RF” approaches –Application for more than document retrieval Basic idea –Retrieve a set of passages similar to query –Mine those passages for words near query Ad-hoc weighting designed to do that –Add words to query and re-run

LCA for story smoothing Convert story to a weighted vector –Inquery weights (incl. Okapi tf component) Select top 100 most highly weighted terms Find top 20 stories most similar (cosine) Weight all terms in top 20 stories (LCA) Select top 100 LCA expansion terms Add to story (decaying weights from 1.0) Story now represented by terms Compare smoothed story vectors

Smoothing SLD with LCA Run on training data (english) Green line is no smoothing Blue is smooth with past stories Pink is smooth with whole corpus (cheating)

Work on Story Link Detection Story “smoothing” using query expansion Score normalization based on language pair

Score normalization Noticed that SYSTRAN documents were throwing scores off substantially –Multilingual SLD was much worse that ENG only Look at distribution of scores in same- topic and different-topic pairs

Score distributions, same topic EE MM ME

Score distributions, diff topic EE MM ME

Clearly need to normalize SYSTRAN stories use different vocabulary –Stories are much more likely to be alike –And much less likely to be like true English Develop normalization based on whether within or cross-language Convert scores into probabilities –Use distribution plots for each case

Combined distribution (before normalization) Same topic Diff. topic

After normalization (on same data--”cheating”) Same topic Diff. topic Probabilities!

DET plots from normalization Huge change in distributions Less pronounced change in DET plot

Conclusions Story smoothing with LCA works –Need to “smooth” with all stories before later –Need to use different matching for smoothing and then story-story comparison Score normalization has potential –Other sites have found similar effects –Experiments on source-type (audio, newswire) within language pairs have been inconclusive Not much training data for doing conversion