Tagging with Queries: How and Why? Ioannis Antonellis antonell@cs.stanford.edu Hector Garcia-Molina hector@cs.stanford.edu Jawed Karim jawed@cs.stanford.edu
Content on the Web Back Link Text Search queries Page Text Forward Link Text Cnn Obama Critics news Stanford Infolab
How? Basic observation: http referrer field contains search query Stanford Infolab 3
How? Stanford Infolab
How? Basic observation: http referrer field contains search query 1) Extract queries from web access log Stanford Infolab 5
Web Access Log a997c1950718d75c03f22ca8715e50b3 [28/Feb/2007:23:45:47 -0800] /group/svsa/cgi-bin/www/officers.php http://www.google.com/search?sourceid=navclient&ie=UTF-8&rls=HPIB,HPIB:2006-47,HPIB:en&q=sexy+random+facts a64344ffd6638d0f6fb2a0284f98b28b [28/Feb/2007:23:45:49 -0800] /group/King/ "http://www.google.com.au/search?hl=en&q=Martin+Luther+King&meta=" 413fa663474b2288c1661882e7e62aea [28/Feb/2007:23:46:02 -0800] /group/pandegroup/folding/results.html "http://www.google.com/search?sourceid=navclient-menuext&ie=UTF-8&q=RESULTS" 3d2edd4dfa7778da92875ee67a319433 [28/Feb/2007:23:46:03 -0800] /group/vpge/sgsi/entrepreneurship/ "http://www.google.com/search?hl=en&q=summer+institute+of+entrepreneurship" ac49793239a6c490023e460fd4863a48 [28/Feb/2007:23:46:06 -0800] / "http://www.google.com/search?sourceid=navclient&hl=ko&ie=UTF-8&rlz=1T4SUNA_ko___KR209&q=stanford" 1c9893680 Stanford Infolab
How? Basic observation: http referrer field contains search query 1) Extract queries from web access log 2) Embed Javascript code in web pages that capture search queries Stanford Infolab 7
Embeddable code Stanford Infolab 8
How? Basic observation: http referrer field contains search query 1) Extract queries from web access log 2) Embed Javascript code in web pages and capture search queries Convince server administrator/page onwer Stanford Infolab 9
Stanford Infolab 10
Query tags Stanford Infolab 11
Information value of Query Tags Datasets: Stanford Query Logs: 360,000 URLs, 900,000 query tags Delicious@Stanford: 3,000 URLs, 5,500 tags WebBase Stanford Infolab 12
Experiments - Summary URLs coverage Query vs Delicious Tags Query/Delicious Tags vs Pagetext Stanford Infolab
URLs coverage Query logs provide tags for ~110 times more URLs than delicious 13% of delicious URLs (380 URLs) only tagged by delicious Stanford Infolab 14
Query Tags Query logs provide 42 query tags per URL on average Stanford Infolab 15
Delicious Tags Delicious provides 3 tags per URL on average Stanford Infolab 16
Tags for common URLs Query logs provide 250 query tags per URL on average for common URLs Delicious provides 5 tags per URL on average for common URLs Stanford Infolab 17
Query Tags vs Page Text For every URL, 1 out of 3 query tags are not present in the pagetext Stanford Infolab 18
Delicious Tags vs Page Text For every URL, 1 out of 2 query tags are not present in the pagetext Stanford Infolab 19
Tags for common URLs For common URLs, 1 out of 2 query/delicious tags not present in the pagetext Stanford Infolab 20
Conclusions Query tags: Can be extracted in a distributed fashion new promising source of information can provide substantially many, new tags, for a large fraction of the Web To be removed Stanford Infolab 21 21
Thank You! (DEMO) http://tags.stanford.edu Stanford Infolab 22
Stanford Infolab 23
Stanford Infolab 24
Stanford Infolab 25
Stanford Infolab 26
Stanford Infolab 27
Stanford Infolab 28
Stanford Infolab 29
Stanford Infolab 30
Stanford Infolab 31
Stanford Infolab 32
How? Stanford Infolab 33
Stanford Infolab 34
Stanford Infolab 35