Data Fusion Eyüp Serdar AYAZ İlker Nadi BOZKURT Hayrettin GÜRKÖK.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Stephan Gammeter, Lukas Bossard, Till Quack, Luc Van Gool.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Evaluating Search Engine
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
 Manmatha MetaSearch R. Manmatha, Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Evaluating the Performance of IR Sytems
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Lecture 9: Rank Aggregation in MetaSearch MetaSearch Engine Social Choice Rules Rank Aggregation.
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Fusing database rankings in similarity-based virtual screening Peter Willett, University of Sheffield.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
“ SINAI at CLEF 2005 : The evolution of the CLEF2003 system.” Fernando Martínez-Santiago Miguel Ángel García-Cumbreras University of Jaén.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
Captain Nemo: a Metasearch Engine with Personalized Hierarchical Search Space ( Stefanos Souldatos, Theodore Dalamagas,
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Querying Structured Text in an XML Database By Xuemei Luo.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Designing multiple biometric systems: Measure of ensemble effectiveness Allen Tang NTUIM.
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Performance Measurement. 2 Testing Environment.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
1 CS 430: Information Discovery Lecture 5 Ranking.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
Collection Fusion in Carrot2
Information Retrieval and Web Search
Compact Query Term Selection Using Topically Related Text
Data Mining Chapter 6 Search Engines
Learning Literature Search Models from Citation Behavior
Evaluation of Clustering Techniques on DMOZ Data
Information Retrieval and Web Design
Preference Based Evaluation Measures for Novelty and Diversity
A Neural Passage Model for Ad-hoc Document Retrieval
Presentation transcript:

Data Fusion Eyüp Serdar AYAZ İlker Nadi BOZKURT Hayrettin GÜRKÖK

Outline What is data fusion? Why use data fusion? Previous work Components of data fusion – System selection – Bias concept – Data fusion methods Experiments Conclusion 2

Data Fusion Merging the retrieval results of multiple systems. A data fusion algorithm accepts two or more ranked lists and merges these lists into a single ranked list with the aim of providing better effectiveness than all systems used for data fusion. 3

Why use data fusion? Combining evidence from different systems leads to performance improvement – Use data fusion to achieve better performance than the individual systems involved in the process. Example metasearch systems – – 4

Why use data fusion? Same idea is also used for different query representations – Fuse the results of different query representations for the same request and obtain better results Measuring relative performance of IR systems such as web search engines is essential – Use data fusion for finding pseudo relevant documents and use these for automatic ranking of retrieval systems 5

Previous work Borda Count method in IR – Models for Metasearch, Aslam & Montague, ‘01 Random Selection, Soboroff et.al., ‘01 Condorcet method in IR – Condorcet Fusion in Information Retrieval, Aslam & Montague, ’02 Reference Count method for automatic ranking, Wu & Crestani, ‘02 6

Previous work Logistic Regression and SVM model – Learning a ranking from Pairwise preferences, Carterette & Petkova, ’06 Fusion in automatic ranking of IR systems – Automatic ranking of information retrieval systems using data fusion, Nuray & Can ’06 7

Components of data fusion 1.DB/search engine selector Select systems to fuse 2.Query dispatcher Submit queries to selected search engines 3.Document selector Select documents to fuse 4.Result merger Merge selected document results 8

Ranking retrieval systems 9

System selection methods 1.Best: certain percentage of top performing systems used 2.Normal: all systems to be ranked are used 3.Bias: certain percentage of systems that behave differently from the norm (majority of all systems) are used 10

More on bias concept A system is defined to be biased if its query responses are different from the norm, i.e., the majority of the documents returned by all systems. Biased systems improve data fusion – Eliminate ordinary systems from fusion – Better discrimination among documents and systems 11

Calculating bias of a system Similarity value Bias of a system 12 v: vector of norm w: vector of retrieval system

Example of calculating bias norm vector  X = X A +X B = (3, 5, 6, 2, 3, 3, 2) s(X A,X)=49/[32][96] 1/2 = Bias(A)= = s(X B,X)=47/[30][96] 1/2 = Bias(B)= = X A =(3, 3, 3, 2, 1, 0, 0) X B =(0, 2, 3, 0, 2, 3, 2) 2 systems: A and B 7 documents: a, b, c, d, e, f, g i th row is the result for i th query

Bias calculation with order Order is important because users usually just look at the documents of higher rank. Increment the frequency count of a document by m/i instead of 1 where m is number of positions and i position of the document. m=4 X A =(10, 8, 4, 2, 1, 0, 0); X B =(0, 8, 22/3, 0, 2, 8/3, 7/3) Bias(A)=0.0087; Bias(B)= systems: A and B 7 documents: a, b, c, d, e, f, g i th row is the result for i th query

Data fusion methods 1.Similarity value models – CombMIN, CombMAX, CombMED, – CombSUM, CombANZ, CombMNZ 2.Rank based models – Rank position (reciprocal rank) method – Borda count method – Condorcet method – Logistic regression model 15

Similarity value methods CombMIN – choose min of similarity values CombMAX – choose max of similarity values CombMED – take median of similarity values CombSUM – sum of similarity values CombANZ - CombSUM / # non-zero similarity values CombMNZ - CombSUM * # non-zero similarity values 16

Rank position method Merge documents using only rank positions Rank score of document i (j: system index) If a system j has not ranked document i at all, skip it. 17

Rank position example 4 systems: A, B, C, D documents: a, b, c, d, e, f, g Query results: A={a,b,c,d}, B={a,d,b,e}, C={c,a,f,e}, D={b,g,e,f} r(a)=1/(1+1+1/2)=0.4 r(b)=1/(1/2+1/3+1)=0.52 Final ranking of documents: (most relev) a > b > c > d > e > f > g (least relev) 18

Borda Count method Based on democratic election strategies. The highest ranked document in a system gets n Borda points and each subsequent gets one point less where n is the number of total retrieved documents by all systems. 19

Borda Count example 3 systems: A, B, C Query results: A={a,c,b,d}, B={b,c,a,e}, C={c,a,b,e} – 5 distinct docs retrieved: a, b, c, d, e. So, n=5. BC(a)=BC A (a)+BC B (a)+BC C (a)=5+3+4=12 BC(b)=BC A (b)+BC B (b)+BC C (b)=3+5+3=11 Final ranking of documents: (most relevant) c > a > b > e > d (least relevant) 20

Condorcet method Also, based on democratic election strategies. Majoritarian method – The winner is the document which beats each of the other documents in a pair wise comparison. 21

Condorcet example 3 candidate documents: a, b, c 5 systems: A, B, C, D, E A: a>b>c - B:a>c>b - C:a>b=c - D:b>a - E:c>a Final ranking of documents a > b = c 22 abc a-4, 1, 0 b1, 4, 0-2, 2, 1 c1, 4, 02, 2, 1- WinLoseTie a200 b011 c011 Pairwise comparisonPairwise winners

Experiments Turkish Text Retrieval System will be used – All Milliyet articles from 2001 to 2005 – 80 different system ranked results 8 matching methods 10 stemming functions – 72 queries for each system 4 approaches for on the experiments 23

Experiments First Approach – Mean average precision values of merged system is significantly greater than al the individual systems Second Approach – Find the data fusion method that gives the highest mean average precision value 24

Experiments Third Approach – Find the best stemming method in terms of mean average precision values Fourth Approach – See the effect of system selection methods 25

Conclusion Data Fusion is an active research area We will use several data fusion techniques on the now famous Milliyet database and compare their relative merits We will also use TREC data for testing if possible We will hopefully find some novel approaches in addition to existing methods 26

References Automatic Ranking of Retrieval Systems using Data Fusion (Nuray,R & Can,F, IPM 2006) Fusion of Effective Retrieval Strategies in the same Information Retrieval System (Beitzel et.al., JASIST 2004) Learning a Ranking from Pairwise Preferences (Carterette et.al., SIGIR 2006) 27

Thanks for your patience. Questions? 28