Presentation is loading. Please wait.

Presentation is loading. Please wait.

Metasearch Thanks to Eric Glover NEC Research Institute.

Similar presentations


Presentation on theme: "Metasearch Thanks to Eric Glover NEC Research Institute."— Presentation transcript:

1 Metasearch Thanks to Eric Glover NEC Research Institute

2 Outline The general search tool
Three C’s of Metasearch and other important issues Metasearch engine architecture Current metasearch projects Advanced metasearch NEC Research Institute

3 A generic search tool Query entered into an interface
Database Ordering Policy Query entered into an interface Applied to a database of content Results are scored/ordered Displayed through the interface NEC Research Institute

4 Why do metasearch? 2 3 1 Interface 1 Database1 Ordering Policy1
NEC Research Institute

5 Definition The word meta comes from the Greek meaning, “situated beyond, transcending.” A metasearch tool is a tool which “transcends” the offerings of a single service by allowing you to access many different Web search tools from one site. More specifically, a metasearch tool permits you to launch queries to multiple search services via a common interface. From Carleton College Library Note: Metasearch is not necessary limited to the Web NEC Research Institute

6 The three C’s of metasearch
Coverage - How much of the total is accessible from a single interface Consistency - One interface and resistant to single search service failures Convenience - One stop shop Service1 Interface Service2 Service3 NEC Research Institute

7 Coverage Coverage refers to the total accessible content, specifically what percentage of the total content According to the Journal Nature in July 1999 (Lawrence and Giles 99): Web search engines collectively covered 42% of web (estimated at 800 Million total indexable pages). Most for a single engine only 16% Some search services have proprietary content, accessible only through their interface, I.e. an auction site Search services have different rates of “refresh” Special purpose search engines are more “up to date” on a special topic NEC Research Institute

8 Consistency and Convenience
One interface User can access multiple DIFFERENT search services, using the same interface Will work even if one search service goes down, or is inconsistent Convenience One stop shop User need not know about all possible sources Metasearch system owners will automatically add new sources as needed One interface improves convenience (as well as consistency) User does not have to manually change their query for each source NEC Research Institute

9 Metasearch issues What do you search How do you search
Source selection How do you search Query translation -- syntax/semantics Query submission Use of specialized knowledge or actions How to process results Fusion of results Actions to improve quality How to evaluate Did the individual query succeed Were the correct decisions made Does this architecture work Decisions after search Search again, do something differently NEC Research Institute

10 Metasearch issues Performance of underlying services
Time, reliability, consistency Result quality - of a single search service How “relevant” are results on average Duplicate detection Freshness -- how often are results updated Update rate, dead links, changed content Probability of finding relevant results For the given query for each potential source For the given need category GLOSS, and similar models How to evaluate Is it better than some other service Especially important with source selection Feedback for improving fusion policy Learning for personalization NEC Research Institute

11

12 Some metasearch engines
Ask Jeeves -- Natural language, and combines local and outside content with a very simple interface Sherlock -- Integrates web and local searching Metacrawler -- Early web metasearch engine, some formalizations SavvySearch -- Research on various methods of source selection ProFusion -- Attempted to predict subject of query and considered expected performance, both for relevance and time of search engines Inquirus -- Content-Based Metasearch Engine Inquirus2 -- Preference based metasearch engine NEC Research Institute

13

14 Architecture Service1 Query Dispatcher Service2 Results Service3 User
Interface Dispatcher Service2 Results Service3 Fusion Policy Result Retriever NEC Research Institute

15 Architecture -- Dispatcher
Query translation Each search service has a different interface (for queries) Preserve semantics, while converting the syntax Could result in loss of expressiveness Query submission sends query to the service Source selection Choose sources to be queried Some systems use wrapper code, or agents as their dispatch mechanism NEC Research Institute

16 Result processor Accept results from search service
Parse results, and relevant information -- i.e. title, URL, etc… Can request more results (feedback to dispatcher) Advanced processors could get more information about the document, I.e., use special purpose tools NEC Research Institute

17 ? } { Result Fusion How to merge results in a meaningful manner?
A = [a1, a2, a3] B = [b1, b2, b3] C = [c1, c2, c3] NEC Research Institute

18 Result Fusion Not comparing apples to apples
Incomplete information, only have a summary Each search service has their own ranking policy Which is better AltaVista #2 or Google #3? Summaries and titles are not consistent between search services Don’t have complete information Questions/issues How much do you trust regular search engine ranks? Could AltaVista #3 be better than AltaVista #2? Is one search engine always better than another? How do you integrate between search engines? What about results returned by more than one search service? NEC Research Institute

19 Fusion policy - a typical approach
First determine the “score” on a fixed range for each result from each search engine In the early days most search engines returned their scores Score could be a function of the rank, or occasionally based on the keywords in the title/summary/URL Second, assign a weight for each search engine Could be based on predicted subject, stated preferences, special knowledge about a particular search engine Example: Service1: A1=1.000, A2=1.000, A3=.95,A4=0.5 Service2: B1=.95, B2=.95, B3=.89, B4=.8 W1 = 0.9, W2= final ordering would be: [B1,B2,A1,A2,B3,A3,B4,A4] NEC Research Institute

20 Source Selection Choosing which services to query
GLOSS -- Choose the databases most likely to contain “relevant” materials SavvySearch (old) -- Choose sources most likely to have the most valuable results based on past responses SavvySearch (new) -- Choose sources most appropriate for the user’s category ProFusion -- User chooses: 1: Fastest sources 2: Most likely to have “relevant” based on predicted subject 3: Explicit user selection Watson -- Choose both general purpose sources and most “appropriate” special purpose sources NEC Research Institute

21 Metacrawler MetaCrawler
Used user result clicks -- implicit vs. explicit Not all pages clicked are relevant Assumed pages not clicked were not relevant Parameters examined Comprehensiveness of individual search engines -- considered Unique document percentage UDP related to coverage As expected, low overlap (assuming first ten documents only) Relative contribution of each search engine, Viewed Document Share (VDS) As expected, all services used contributed to the VDS -- the maximum of the eight search services was 23%, and the minimum 4%, with four of them contributing 15% or more NEC Research Institute

22 ProFusion ProFusion Focus was primarily on source selection
Profusion’s considered: Performance (time) Ability to locate relevant results (sources) by query subject prediction Design A set of 13 categories and a dictionary of 4000 terms used to predict subject For each category each search engine (of the six) is “tested” and scored based on the number of relevant results retrieved Individual search engine “scores” are used to produce a rank ordering by subject, and to fuse results Parameters examined Human judgements of some queries Every search engine was different ProFusion demonstrated improvements when the “auto-pick” was used NEC Research Institute

23 SavvySearch (early work)
Similar to ProFusion, choose sources based on query Assign “scores” for each query term based on previous query results Formed a txn matrix (terms by search engines) called a meta-index Score is based on performance for each term: Two “events” - No Results and Visits Scores are adjusted for the number of query terms Response time is also stored for each search engine Search engines are chosen based on query terms, and past performance System load determines the total number queried Evaluated via a pilot study Compared various variations of the sources chosen and their rank As predicted using the meta-index method was better than random Also examined improvements over time in the no result count New version user chooses a “category” and appropriate services are used NEC Research Institute

24 Advanced metasearch Improvements
Ordering policy verses a fusion policy -- Inquirus, and some personal search agents Using complete information -- download document before scoring All documents scored consistently regardless of source Allows for improved relevance judgements Can hurt performance Query modification -- Inquirus 2, Watson, others? User queries are modified to improve ability to find “category specific” results Query modification allows general purpose search engines for special needs Need based scoring -- Inquirus 2, Learning systems Documents scores are based on more than query Can use past user history, or other factors such as subject NEC Research Institute

25 Design/Research areas
Source selection With and without complete information Learning interfaces and predicting contents of sources Intelligent Fusion Without complete information Consider user’s preferences Resource efficiency Knowing how to “plan” a search User interfaces How to minimize loss of expressiveness How to preserve the capabilities of select services To hide slow performance NEC Research Institute

26 Business issues for metasearch
How does one use others resources and not get blocked? Skimming!


Download ppt "Metasearch Thanks to Eric Glover NEC Research Institute."

Similar presentations


Ads by Google