Presentation is loading. Please wait.

Presentation is loading. Please wait.

Agreeing to Disagree: Search Engines and Their Public Interfaces

Similar presentations


Presentation on theme: "Agreeing to Disagree: Search Engines and Their Public Interfaces"— Presentation transcript:

1 Agreeing to Disagree: Search Engines and Their Public Interfaces
Frank McCown and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 22, 2007

2 Agenda Screen-scraping the web user interface (WUI) Search engine APIs
Comparing search results Five month experiment Significant findings and conclusions

3

4

5

6 Google’s Terms of Service
No screen-scraping!

7

8 The Black Box Web Search Engine WUI result API The Ideal

9 Why are my results different?

10 The Black Box Web Search Engine resultw WUI API resultA The Reality

11 What Do Researchers Think?
“…due to legal limitations on automatic queries, we used the Google, MSN, and Yahoo! web search APIs, which are, reportedly, served from older and smaller indices than the indices used to serve human users.” - Bar-Yossef and Gurevich (2006), Random sampling from a search engine’s index

12 What Do Researchers Think?
“…the Google API... could have been used to automate the initial collection of the results pages, which should have given the same outcome [as using the WUI].” - Thelwall (2004), Can the Web give useful information about commercial uses of scientific research?

13 Research Questions Are the APIs pulling from an older and smaller index? How different are the results between the interfaces? Which APIs are most synchronized with their WUIs? Are there synchronization differences based on query term types? Can we model the decay of search results over time?

14 Comparing Search Results
WUI API A B C D E D A F

15 Comparing Search Results
(WUI results) Two top k lists: (API results) Overlap (P) – Compare set membership Kendall tau (K)1 – Penalize changes in position Measure (M)2 – Penalize position changes at the top more heavily More similar 1 1Fagin, et al. (2003), Comparing top k lists 2Bar-Ilan, et al. (2006), Methods for comparing rankings of search engine results

16 Comparing Search Results
WUI API WUI API A B C D E F P = K = M = 0.66 A B C D E D A F No improvement Mild improvement Significant improvement P = K = M = 0.14

17 Experiment Design Daily queries to WUIs and APIs (5 months in 2006):
General search terms (50 popular terms and 50 computer science terms) URL backlinks Total URLs indexed for a website URL indexing and caching

18 Comparing WUI to WUI Day n vs. n-1 (K distance)

19 Comparing WUI to API Comparing WUI to API

20 Identical WUI and API Results

21 Decay of Results Over Time
X = half life

22 Total Results Loose Disagreements

23 Total Backlinks Loose Disagreements

24 Total Indexed Pages Loose Disagreements

25 Indexed and Cached Status ■ = disagreement

26 Synchronized Interfaces

27 Conclusions How different are the top 10 results?
Google = 20% Yahoo = 14% MSN = 8% Google’s and Yahoo’s interface synchronization is affected by the query type (popular vs. CS) MSN has the overall most synchronized interfaces For popular terms, how long will it take for half of the results to change? Google and Yahoo = over a year MSN = 2-3 months Our study has confirmed the intuitive notion that websites that are crawler-friendly are more likely to be better preserved by the WI.

28 Frank McCown fmccown@cs.odu.edu http://www.cs.odu.edu/~fmccown/
Thank You Is Dad finished yet? Frank McCown


Download ppt "Agreeing to Disagree: Search Engines and Their Public Interfaces"

Similar presentations


Ads by Google