Manmatha MetaSearch R. Manmatha, Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst.
Manmatha Introduction MetaSearch / Distributed Retrieval – Well defined problem Language Models are a good way to solve these problems. – Grand Challenge Massively Distributed Multi-lingual Retrieval
Manmatha MetaSearch Combine results from different search engines. – Single Database – Or Highly Overlapped Databases. » Example, Web. – Multiple Databases or Multi-lingual databases. Challenges – Incompatible scores even if the same search engine is used for different databases. » Collection Differences, and engine differences. – Document Scores depend on query. Combination on a per query basis makes training difficult. Current Solutions involve learning how to map scores between different systems. – Alternative approach involves aggregating ranks.
Manmatha Current Solutions for MetaSearch – Single Database Case Solutions – Reasonable solutions involving mapping scores either by simple normalization, equalizing score distributions, training – Rank Based methods – eg Borda counts, Markov Chains.. – Mapped scores are usually combined using linear weighting. – Performance improvement about 5 to 10%. – Search engines need to be similar in performance » May explain why simple normalization schemes work. Other Approaches – A Markov Chain approach has been tried. However, results on standard datasets are not available for comparison. – Shouldn’t be difficult to try more standard LM approaches.
Manmatha Challenges – MetaSearch for Single Databases Can one combine search engines which differ a lot in performance effectively? – Improve performance even using poorly performing engines? How? – Or use resource selection like approach case to eliminate poorly performing engines on a per query basis. Techniques from other fields. – Techniques in economics and social sciences for voter aggregation may be useful (Borda count, Condorcet..) LM approaches – Will possibly improve performance by characterizing the scores at a finer granularity than say score distributions.
Manmatha Multiple Databases Two main factors determine variation in document scores – Search engine scoring functions. – Collection variations which essentially change the IDF. Effective score normalization requires – Disregarding databases which are unlikely to have the answer » Resource Selection. – Normalizing out collection variations on a per query basis. – Mostly ad hoc normalizing functions. Language Models. – Resource Descriptions already provide language models for collections. – Could use these to factor out collection variations. – Tricky to do this for different search engines.
Manmatha Multi-lingual Databases Normalizing scores across multiple databases. – Difficult Problem Possibility: – Create language models for each database. – Use simple translation models to map across databases. – Use this to normalize scores. – Difficult.
Manmatha Distributed Web Search Distribute web search over multiple sites/servers. – Localized/ Regional. – Domain dependent. – Possibly no central coordination. – Server Selection/ Database Selection with/without explicit queries. Research Issues – Partial representations of the world. – Trust, Reliability. Peer to peer.
Manmatha Challenges Formal Methods for Resource Descriptions, Ranking, Combination – Example. Language Modeling – Beyond collections as big documents Multi-lingual retrieval – Combining the outputs of systems searching databases in many languages. Peer to Peer Systems – Beyond broadcasting simple keyword searches. – Non-centralized – Networking considerations e.g. availability, latency, transfer time. Distributed Web Search Data, Web Data.