Agreeing to Disagree: Search Engines and Their Public Interfaces

Slides:



Advertisements
Similar presentations
1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
Advertisements

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Department of Electrical Engineering, Technion Maxim Gurevich Department of Electrical Engineering,
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
SCRAPING BUSINESS PHONE NOS Anisha S. Agenda When business URLs are present When business URLs are not present; What is present is a list of keywords.
1 Web Search and Web Search Overlap: What the Deal? Amanda Spink Queensland University of Technology.
Project Title: Deepin Search Member: Wenxu Li & Ziming Zhai CSCI 572 Project.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
1 ETT 429 Spring 2007 Microsoft Publisher II. 2 World Wide Web Terminology Internet Web pages Browsers Search Engines.
How Search Engines Work Source:
Web Archive Information Retrieval Miguel Costa, Daniel Gomes (speaker) Portuguese Web Archive.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Lazy Preservation: Reconstructing Websites from the Web Infrastructure Frank McCown Advisor: Michael L. Nelson Old Dominion University Computer Science.
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
Stage 1: Keyword Research. List of Tasks 1. Find 15 keywords that fulfill our criteria 2. Capture the screen of Market Samurai’s SEO Competition Matrix.
SCRAPING BUSINESS ADDRESSES Anisha S. Agenda When business URLs are present When business URLs are not present; What is present is a list of keywords.
Website Reconstruction using the Web Infrastructure Frank McCown Doctoral Consortium June.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
1 Can People Collaborate to Improve the relevance of Search Results? Florian Eiteljörge June 11, 2013Florian Eiteljörge.
Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson Old Dominion University, USA {sainswor, aalsum, hany, mweigle,
Influence of Search Engines Christina Pong cs349.
Personalized Search Cheng Cheng (cc2999) Department of Computer Science Columbia University A Large Scale Evaluation and Analysis of Personalized Search.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Search Yahoo! With Boolean Operators AND, OR, (), “”, NOT, Domain:
Dynamic Web File Format Transformations with Grace Daniel S. Swaney, Frank McCown, and Michael L. Nelson Old Dominion University Computer Science Department.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
YZUCSE SYSLAB A Study of Web Search Engine Bias and its Assessment Ing-Xiang Chen and Cheng-Zen Yang Dept. of Computer Science and Engineering Yuan Ze.
Query trends CS 349 Presentation December 2 nd, 2008 Catherine Grevet.
Coverage and Independence: Measuring Quality in Web Search Results Panagiotis Takis Metaxas Lilia Ivanova Eni Mustafaraj Department of Computer Science.
My Website Was Lost, But Now It’s Found Frank McCown CS 110 – Intro to Computer Science April 23, 2007.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Database VS. Search Engine Explore the difference between database* and search results Next.
Can scientific collaboration and excellence be measured by Web presence and Web links? Judit Bar-Ilan Bar-Ilan University and The Hebrew University of.
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
Evaluation of the NSDL and Google for Obtaining Pedagogical Resources Frank McCown, Johan Bollen, and Michael L. Nelson Old Dominion University Computer.
Ph.D. Progress Report Frank McCown 4/14/05. Timeline Year 1 : Course work and Diagnostic Exam Year 2 : Course work and Candidacy Exam Year 3 : Write and.
Lazy Preservation, Warrick, and the Web Infrastructure Frank McCown Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007.
Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer.
The Availability and Persistence of Web References in D-Lib Magazine Frank McCown, Sheffan Chan, Michael L. Nelson and Johan Bollen Old Dominion University.
Lazy Preservation: Reconstructing Websites by Crawling the Crawlers Frank McCown, Joan A. Smith, Michael L. Nelson, & Johan Bollen Old Dominion University.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
What is Seo? Search Engine Optimization for Dummies.
SEO: top-rankings in Google Harald J. Koch. Why are top-rankings in search engines that important?
Client-Side Preservation Techniques for ORE Aggregations Michael L. Nelson & Sudhir Koneru Old Dominion University, Norfolk VA OAI-ORE Specification Roll-Out.
Brass: A Queueing Manager for Warrick Frank McCown, Amine Benjelloun, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk,
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
Retroactive Answering of Search Queries Beverly Yang Glen Jeh Google, Inc. Presented.
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Aim: How can we best search the internet using various search engines?
Microsoft Office Illustrated Introductory, Premium Edition
Web Traffic Analysis Script PHP Web Traffic Analysis Script PHP Web Traffic Analysis Software.
Lazy Preservation, Warrick, and the Web Infrastructure
Search Engines & Subject Directories
Search Pages and Results
Newsletters An automatic news recommender system
Correlation of Term Count and Document Frequency for Google N-Grams
Data Mining Chapter 6 Search Engines
Top Search Engines.
Characterization of Search Engine Caches
Anatomy of a Search Search The Index:
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Identify Different Chinese People with Identical Names on the Web
Introduction to Digital Libraries Assignment #3
Search Engines & Subject Directories
Search Engines & Subject Directories
Correlation of Term Count and Document Frequency for Google N-Grams
Web Programming Assignment 4 - Extra Credit
Old Dominion University Computer Science IIPC New Member
Presentation transcript:

Agreeing to Disagree: Search Engines and Their Public Interfaces Frank McCown and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 22, 2007

Agenda Screen-scraping the web user interface (WUI) Search engine APIs Comparing search results Five month experiment Significant findings and conclusions

Google’s Terms of Service No screen-scraping!

The Black Box Web Search Engine WUI result API The Ideal

Why are my results different?

The Black Box Web Search Engine resultw WUI API resultA The Reality

What Do Researchers Think? “…due to legal limitations on automatic queries, we used the Google, MSN, and Yahoo! web search APIs, which are, reportedly, served from older and smaller indices than the indices used to serve human users.” - Bar-Yossef and Gurevich (2006), Random sampling from a search engine’s index

What Do Researchers Think? “…the Google API... could have been used to automate the initial collection of the results pages, which should have given the same outcome [as using the WUI].” - Thelwall (2004), Can the Web give useful information about commercial uses of scientific research?

Research Questions Are the APIs pulling from an older and smaller index? How different are the results between the interfaces? Which APIs are most synchronized with their WUIs? Are there synchronization differences based on query term types? Can we model the decay of search results over time?

Comparing Search Results WUI API A B C D E D A F

Comparing Search Results (WUI results) Two top k lists: (API results) Overlap (P) – Compare set membership Kendall tau (K)1 – Penalize changes in position Measure (M)2 – Penalize position changes at the top more heavily More similar 1 1Fagin, et al. (2003), Comparing top k lists 2Bar-Ilan, et al. (2006), Methods for comparing rankings of search engine results

Comparing Search Results WUI API WUI API A B C D E F P = 0.50 K = 0.56 M = 0.66 A B C D E D A F No improvement Mild improvement Significant improvement P = 0.50 K = 0.44 M = 0.14

Experiment Design Daily queries to WUIs and APIs (5 months in 2006): General search terms (50 popular terms and 50 computer science terms) URL backlinks Total URLs indexed for a website URL indexing and caching

Comparing WUI to WUI Day n vs. n-1 (K distance)

Comparing WUI to API Comparing WUI to API

Identical WUI and API Results

Decay of Results Over Time X = half life

Total Results Loose Disagreements

Total Backlinks Loose Disagreements

Total Indexed Pages Loose Disagreements

Indexed and Cached Status ■ = disagreement

Synchronized Interfaces

Conclusions How different are the top 10 results? Google = 20% Yahoo = 14% MSN = 8% Google’s and Yahoo’s interface synchronization is affected by the query type (popular vs. CS) MSN has the overall most synchronized interfaces For popular terms, how long will it take for half of the results to change? Google and Yahoo = over a year MSN = 2-3 months Our study has confirmed the intuitive notion that websites that are crawler-friendly are more likely to be better preserved by the WI.

Frank McCown fmccown@cs.odu.edu http://www.cs.odu.edu/~fmccown/ Thank You Is Dad finished yet? Frank McCown fmccown@cs.odu.edu http://www.cs.odu.edu/~fmccown/