Strategies for collecting prices on Internet Olav ten Bosch June 20th 2013
Content Why internet as a data source (IAD)? Internet robots, how do they work? Examples Conclusion
Why IAD? Administrative sources Tax, social security services Municipalities/ Provinces Supermarkets and Surveys
Why IAD? Internet sources Administrative sources Faster, better, more efficient Administrative sources Tax, social security services Municipalities/ Provinces Supermarkets … Surveys New indicators Internet sources Less!!!
Google Trends (1) Search on “fever” from the Netherlands 2004 - today (31 may 2013)
Google Trends (2) Search on “fever” from the Netherlands Last 90 days (31 may 2013)
Original Content No added value ? Content enrichment
Robots / crawlers / bots / spiders / scrapers: how do they work ? (1) Internet Requests Graphical markup Website Commands code, figures, style, data, Etc. Browser You
5 maart 2013 - Internet Robots bij het CBS
Robots / crawlers / bots / spiders / scrapers: how do they work? (3) Navigation Internet Requests Graphical markup Website Commands code, figures, style, data, etc Robot/ spider/ crawler Not You Data
Robots / crawlers / bots / spiders / scrapers: how do they work? (4) Navigation Internet Requests Graphical markup Website Commands code, figures, style, data, etc Robot/ spider/ crawler Not You Data Monitor actively
Robots / crawlers / bots / spiders / scrapers: how do they work? (5) Many sites have same structure / pattern: Search (ex. region / category / price) List of results, 1 or more pages (previous / next) Short description for each item Click to go to detail view of item Sites do have differences: Dynamics: “births” en “deaths” of items Comparability of items / articles / objects categories (brands, colors, sizes)
Housing market (1)
Housing market (2) Difference in update speed between 2 housing sites calculated from robot data Verschil in dagen van verschijnen objecten op site 1 versus site 2
Airline tickets (2010)
Airline tickets (2010)
Airline tickets (2010)
Vliegreizen (2010) ? Many differences Both robots see high prices Robot2 initialization phase
Airline tickets(2010)
Clothing: Site 1: 15 months, daily, very volatile Site 2: 8 months, 30 000 items per day, more stable
Clothing: from volatile data to statistics
Pilot for EGR Wikipedia as a secondary data source? Wikipedia: company info for 41 000 businesses
Cinema tickets: Few information on many sites
Conclusion IAD useful to reduce response burden and for innovation Many objects on few sites => generic robot software Few objects on many sites => tool for semi-automated price collection Legislation: we operate as transparant as possible Challenges: The internet changes continuously!!! Which content is original, which is stable? From volatile data sources to stable statistics We need advanced statistical methods, processes and IT