Agreeing to Disagree: Search Engines and Their Public Interfaces Frank McCown and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 22, 2007
Agenda Screen-scraping the web user interface (WUI) Search engine APIs Comparing search results Five month experiment Significant findings and conclusions
Google’s Terms of Service No screen-scraping!
The Black Box Web Search Engine WUI result API The Ideal
Why are my results different?
The Black Box Web Search Engine resultw WUI API resultA The Reality
What Do Researchers Think? “…due to legal limitations on automatic queries, we used the Google, MSN, and Yahoo! web search APIs, which are, reportedly, served from older and smaller indices than the indices used to serve human users.” - Bar-Yossef and Gurevich (2006), Random sampling from a search engine’s index
What Do Researchers Think? “…the Google API... could have been used to automate the initial collection of the results pages, which should have given the same outcome [as using the WUI].” - Thelwall (2004), Can the Web give useful information about commercial uses of scientific research?
Research Questions Are the APIs pulling from an older and smaller index? How different are the results between the interfaces? Which APIs are most synchronized with their WUIs? Are there synchronization differences based on query term types? Can we model the decay of search results over time?
Comparing Search Results WUI API A B C D E D A F
Comparing Search Results (WUI results) Two top k lists: (API results) Overlap (P) – Compare set membership Kendall tau (K)1 – Penalize changes in position Measure (M)2 – Penalize position changes at the top more heavily More similar 1 1Fagin, et al. (2003), Comparing top k lists 2Bar-Ilan, et al. (2006), Methods for comparing rankings of search engine results
Comparing Search Results WUI API WUI API A B C D E F P = 0.50 K = 0.56 M = 0.66 A B C D E D A F No improvement Mild improvement Significant improvement P = 0.50 K = 0.44 M = 0.14
Experiment Design Daily queries to WUIs and APIs (5 months in 2006): General search terms (50 popular terms and 50 computer science terms) URL backlinks Total URLs indexed for a website URL indexing and caching
Comparing WUI to WUI Day n vs. n-1 (K distance)
Comparing WUI to API Comparing WUI to API
Identical WUI and API Results
Decay of Results Over Time X = half life
Total Results Loose Disagreements
Total Backlinks Loose Disagreements
Total Indexed Pages Loose Disagreements
Indexed and Cached Status ■ = disagreement
Synchronized Interfaces
Conclusions How different are the top 10 results? Google = 20% Yahoo = 14% MSN = 8% Google’s and Yahoo’s interface synchronization is affected by the query type (popular vs. CS) MSN has the overall most synchronized interfaces For popular terms, how long will it take for half of the results to change? Google and Yahoo = over a year MSN = 2-3 months Our study has confirmed the intuitive notion that websites that are crawler-friendly are more likely to be better preserved by the WI.
Frank McCown fmccown@cs.odu.edu http://www.cs.odu.edu/~fmccown/ Thank You Is Dad finished yet? Frank McCown fmccown@cs.odu.edu http://www.cs.odu.edu/~fmccown/