M. Yağmur Şahin Çağlar Terzi Arif Usta. Introduction What similarity calculations should be used? For each type of queries For each or type of documents.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Chapter 5: Introduction to Information Retrieval
Improved TF-IDF Ranker
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Effective Keyword Search in Relational Databases Fang Liu (University of Illinois at Chicago) Clement Yu (University of Illinois at Chicago) Weiyi Meng.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
CS 430 / INFO 430 Information Retrieval
Evaluating Search Engine
Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
1 CS 430: Information Discovery Lecture 10 Cranfield and TREC.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Modeling Modern Information Retrieval
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Evaluating the Performance of IR Sytems
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
The Vector Space Model …and applications in Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
Information Retrieval
INFORMATION RETRIEVAL VECTOR SPACE MODEL IN-DEPTH PART 2 Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Performance Measurement. 2 Testing Environment.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
An Efficient Information Retrieval System Objectives: n Efficient Retrieval incorporating keyword’s position; and occurrences of keywords in heading or.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
1 CS 430: Information Discovery Lecture 5 Ranking.
Information Retrieval Lecture 6 Vector Methods 2.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries Thomas Roelleke,
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
IR 6 Scoring, term weighting and the vector space model.
Automated Information Retrieval
CS 430: Information Discovery
Information Retrieval and Web Search
Ranking in IR and WWW Modern Information Retrieval: A Brief Overview
15-826: Multimedia Databases and Data Mining
Modern Information Retrieval
CS 430: Information Discovery
1Micheal T. Adenibuyan, 2Oluwatoyin A. Enikuomehin and 2Benjamin S
INF 141: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

M. Yağmur Şahin Çağlar Terzi Arif Usta

Introduction What similarity calculations should be used? For each type of queries For each or type of documents Type of desired performance Is there a “silver bullet” for measurement? To find the answer Q-expression (8-position string) Test by extending database system mg Experiments on TREC environment

Similarity Measure Recall – Precision TREC Conference Range of sources are used Van Rijsbergen [1979] Salton and McGill [1983] Salton [1989] Frakes and Baeza-Yates [1992] Extension of previous work of Salton and Buckley [1988] *sonraki cumleler

Combining functions Combining functions correspond to importance of each term in the document, importance of that term in the query, length or weight of the document, length of the query

Term Weight Inverse Document Frequency (IDF) Salton and Buckley [1988]’s three different term weighting rules Document-term and query-term weight Only one of them, both of them or none of them can be used

Relative Term Frequency TF TF-IDF w d,t = r d,t * w t Salton and Buckley [1988] described three different RTF formulations

Q-Expression 8-position string BB-ACB-BAA

Experiments Aim is the best combination Exhaustive enumeration [AB][BDI]-[AB][CEF][BDIK]-[AB][ACE]A 720 possibilites 5-10 minutes CPU time per mechanism 2-4 seconds per query per collection Total: 4 weeks

Experiments 6 experimental domains 3 sets of queries Title, narrative, full 2 sets of collections Ap2wsj2 (Newspaper articles) Fr2ziff2 (Non-newspaper articles) 3 effectiveness measures average 11-point recall-precision average over the query set, average precision-at-20 value for the query set average reciprocal rank of the first relevant document retrieved

Experiments

Conclusion They failed to find any particular measure that really stood out but discovered that no measure consistently worked well across all of the queries in a query set No component or weighting scheme was shown to be consistently valuable across all of the experimental domains Better performance can be obtained - by choosing a similarity measure to suit each query on an individual basis IMPLAUSIBLE!