CMU Y2 Rosetta GnG Distillation

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Introduction to Information Retrieval

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Catching the Drift: Learning Broad Matches from Clickthrough Data Sonal Gupta, Mikhail Bilenko, Matthew Richardson University of Texas at Austin, Microsoft.

Optimizing Estimated Loss Reduction for Active Sampling in Rank Learning Presented by Pinar Donmez joint work with Jaime G. Carbonell Language Technologies.

Information Retrieval in Practice

Search Engines and Information Retrieval

Learning to Rank: New Techniques and Applications Martin Szummer Microsoft Research Cambridge, UK.

IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.

2D1431 Machine Learning Boosting.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.

Recommender systems Ram Akella November 26 th 2008.

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

Overview of Search Engines

Online Learning Algorithms

Information Retrieval in Practice

Search Engines and Information Retrieval Chapter 1.

TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.

Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Querying Structured Text in an XML Database By Xuemei Luo.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Clustering Top-Ranking Sentences for Information Access Anastasios Tombros, Joemon Jose, Ian Ruthven University of Glasgow & University of Strathclyde.

CLASSIFICATION: Ensemble Methods

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Chapter 23: Probabilistic Language Models April 13, 2004.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

FACE DETECTION : AMIT BHAMARE. WHAT IS FACE DETECTION ? Face detection is computer based technology which detect the face in digital image. Trivial task.

1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

Bayesian Query-Focused Summarization Slide 1 Hal Daumé III Bayesian Query-Focused Summarization Hal Daumé III and Daniel Marcu Information.

Information Retrieval in Practice

Information Retrieval in Practice

Language Identification and Part-of-Speech Tagging

Machine Learning: Ensemble Methods

Sampath Jayarathna Cal Poly Pomona

Queensland University of Technology

OPERATING SYSTEMS CS 3502 Fall 2017

Ranking and Learning 290N UCSB, Tao Yang, 2014

Search Engine Architecture

Semantic Processing with Context Analysis

An overview of decoding techniques for LVCSR

Evaluation of IR Systems

An Empirical Study of Learning to Rank for Entity Search

DM-Group Meeting Liangzhe Chen, Nov

Implementation Issues & IR Systems

Tingdan Luo 05/02/2016 Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem Tingdan Luo

Multimedia Information Retrieval

Compact Query Term Selection Using Topically Related Text

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.

Classification of class-imbalanced data

Towards a Personal Briefing Assistant

iSRD Spam Review Detection with Imbalanced Data Distributions

Effective Entity Recognition and Typing by Relation Phrase-Based Clustering

Probabilistic Databases

Retrieval Utilities Relevance feedback Clustering

Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007

Multiobjective Optimization

Presentation transcript:

CMU Y2 Rosetta GnG Distillation Jonathan Elsas Jaime Carbonell

Rosetta GnG System Evolution Y1 System Y2 System Rank Learning Y3 Eval+ Y1: CMU initiated bulk of the work, essentially an indri-backed passage retrieval system with simple duplicate detection. IBM handled post-filtering, snippet composition, redundancy, etc. Y2: IBM took over main development of system, still using Indri primarily for document retrieval. We addressed specific challenges identified in the Y1 system -- how to utilize previously identified relevant docs & passages to tune the importance of different aspects of the templated query. Y2 Eval Y1 Eval

Distillation Challenges Multiple aspects to information need: Query arguments, Locations, Related Words Static expansion terms/phrases Bigrams, trigrams, term windows Named-Entity wildcards & constraints Occurrence of each of these in a document* is a “feature” indicating relevance of the document* to the information need. Question: How to best choose the weights for each feature? Challenges in retrieval identified in the first year: How to weight different aspects of the information need to optimize ranked retrieval performance * Or sentences, paragraphs, “nuggets”, etc.

Query Feature Construction DESCRIBE THE ACTIONS OF [Mahmoud Abbas] DURING… Location : Middle East Equivalent terms: Mahmoud Abbas Abu Mazen President of the Palestinian National Authority Query Features: Unigram Features

Query Feature Construction DESCRIBE THE ACTIONS OF [Mahmoud Abbas] DURING… Location : Middle East Equivalent terms: Mahmoud Abbas Abu Mazen President of the Palestinian National Authority Query Features: Bigram & Term Window Features

Query Feature Construction DESCRIBE THE ACTIONS OF [Mahmoud Abbas] DURING… Location : Middle East Equivalent terms: Mahmoud Abbas Abu Mazen President of the Palestinian National Authority Query Features: Entity-Type Constrained Features

Query Feature Construction DESCRIBE THE ACTIONS OF [Mahmoud Abbas] DURING… Location : Middle East Equivalent terms: Mahmoud Abbas Abu Mazen President of the Palestinian National Authority Co-ref features: Aliases, Nominal references (roles, descriptions), Pronominal references Query Features: Entity Co-reference Features

Query Feature Construction DESCRIBE THE ACTIONS OF [Mahmoud Abbas] DURING… Location : Middle East Equivalent terms: Mahmoud Abbas Abu Mazen President of the Palestinian National Authority Query Features: Static Template-based expansion (unigram, bigram, term windows)

Query Feature Construction DESCRIBE THE ACTIONS OF [Mahmoud Abbas] DURING… Location : Middle East Equivalent terms: Mahmoud Abbas Abu Mazen President of the Palestinian National Authority Just scratches the surface. Other features include (1) dynamic query expansion within corpus or using external corpora (2) document-structure based features (headline, body, slug) (3) SRL-based features *** EMPHASIZE THIS (4) predictive annotation features (5) features derived from translation/ASR artifacts Impractical to do an exhaustive search of the hypothesis space with as few as 4 or 5 features. Query Features: + potentially many more: structural features, PRF, & SRL annotations

Learning Approach to Setting Feature Weights Goal: Utilize existing relevance judgments to learn optimal weight setting Recently has become a hot research area in IR. “Learning to Rank”

Pair-wise Preference Learning Learning a document scoring function Treated as a classification problem on pairs of documents: Resulting scoring function is used as the learned document ranker. Correct Not just Documents --- passages, nuggets, documents, etc. Why pair-wise preference instead of list-wise or classifying rel/nonrel? (1) allows application of existing classification techniques (2) from a operational perspective, it may be easier/more intuitive to collect preference data rather than forcing users to put documents into some graded relevance scale (3) it works better than classifying rel/nonrel Incorrect

Committee Perceptron Algorithm Online algorithm (instance-at-a-time) Fast training, low memory requirements Ensemble method Selectively chooses N best hypotheses encountered during training “N heads are better than 1” approach Significant advantages over previous perceptron variants Many ways to combine output of hypotheses Voting, score averaging, hybrid approaches This is the focus of current research Our approach shows performance improvements over existing rank learning algorithms with a Significant reduction in training time -- 45 TIMES faster

Committee Perceptron Training Training Data Committee q, dR, dN Current Hypothesis R N

Committee Perceptron Training Training Data Committee q, dR, dN Current Hypothesis R N

Committee Perceptron Training Training Data Committee q, dR, dN Current Hypothesis R N If current hypothesis better than worst: Replace worst hypothesis in committee Otherwise: discard current hypothesis Update current hypothesis to better classify this training example

Committee Perceptron Training Training Data Committee q, dR, dN Current Hypothesis R N If current hypothesis better than worst: Replace worst hypothesis in committee Otherwise: discard current hypothesis Update current hypothesis to better classify this training example

Committee Perceptron Performance Comparable or better performance than two state-of-the-art batch leanring algorithms Added Bonus: more than 45 times faster training time than RankSVM

Committee Perceptron Learning Curves Committee/Ensemble approach a better solution faster than existing perceptron variants

Next Steps (in progress) Integrate current work with GALE GnG system Document ranking is the obvious first step Passage ranking poses additional challenges Both will be addressed this year Implement feature-based query generation framework for Rosetta GnG System Extend & improve performance of our rank learning algorithm

Future Work Investigate application of preference learning in Utility system, adapting to real-time user preference feedback.