Parallel and Distributed Searching
Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed Searching –Collection Partioning –Query Processing –Collection/Results Fusion
Boolean Queries Queries with terms connected by AND OR and NOT –(Internet AND retrieval) AND (NOT english) –“world wide web” OR internet
Advantages Easy to Implement Allow very precise query specifications Facilitate parallel execution
Disadvantages People are bad at Boolean algebra Difficult to interpret to get effective relevance ranking Difficult to include sensible query weighting
Parallel Searching Useful in improving performance in very large/heavily used search engines break query down into several subqueries execute each at the same time combine results share subqueries between different searches
Distributed Searching More about metasearching and turning plain searching into metasearching
Distribution Methods Multiple copies of collection: mirror sites Why not split the documents between servers according to their topics ?
Collection Partioning Manual/Semi automatic Topic Partioning –medical vs engineering –books vs CD’s One Central Index One Index per server
Distributed Query Processing Select collections to search distribute query to selected collections evaluate query at selected servers in parallel combine results into a final result
Source Selection Obtain global term distribution data –on the web ????? Analyse central index of collection relevance Missing gems
Missing Gems Example Query –wear characteristics of high titanium steel alloys –actually occurs in medical collection describing use in artificial hips
Results Fusion Want to present a single result collected from several sources Also known as collection fusion because it makes several collections appear as one
Results Fusion How do you put together the results from several web sites/search engines into a single combined result ? Collection at a time Round robin Relevance Ranked
Collection at a Time Use e.g. tf * idf across each collection to rank searched collection by relevance Display the results from the best collection first
Tf *idf Tf - term frequency –terms that are frequently mentioned in individual documents improve recall idf - inverse document frequency –inversely proportional to the number of documents which mention a term –prefers discriminating terms
Round Robin Take the first document from collection 1 Then the first document from collection 2 and so on for each collection then the second document from collection 1 and so on
Relevance based methods Calculate Relevance for the documents returned by each selected source Try to calculate some global statistics Use some special measures
Other Alternatives Random Firstcome first show etc ….
Conclusions Parallel Searching is one way to speed up searching Distributing Information can help ease/speed searching and but has some dangers Some solutions to the results fusion problem