Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia-Molina Department of Computer Science Stanford University SIGIR 2008 Presentation by Sangkeun Lee Center for E-Business Technology Seoul National University Seoul, Korea
Copyright 2009 by CEBT Introduction Social tags? Keyword annotations Tags are poorly understood Given a set of objects, and a set of tags applied to those objects by users Can we predict whether a given tag could/should be applied to a particular object? We call this the ‘Social Tag Prediction’ Semantic Tech & Context - 2
Copyright 2009 by CEBT What Can we use Social Tag Prediction for? Increase Recall of Single Tag Queries/Feeds Inter-User Agreement Sharing objects despite vocabulary differences Tag Disambiguation What’s ‘Apple’? The Company or the fruit? Bootstrapping The way users use tags is determined by previous usage Pre-seed System Suggestion Semantic Tech & Context - 3
Copyright 2009 by CEBT Preliminaries A social tagging system consists of users u ∈ U, tags t ∈ T, and objects o ∈ O. A Post A set of tags on an object by a user Is made up of one or more (t i, u j, o k ) triples R p : a set of tags that describe object the object A set of (t, o) pairs where each pair means that tag t positively describes object o. R n : a set of tags that do not describe object the object A set of (t, o) pairs where each pair means that tag t negatively describes object o. R a : a set of tags that users annotates the object. A set of (t, u, o) triples where each triple means that user u annotated object o with tag t. Semantic Tech & Context - 4
Copyright 2009 by CEBT Preliminaries (cont’d) Examples Rp = (t bagels, o bagels ), (t shop, o bagels ), (t downtown, o bagels ), (t pizza, o pizza ), (t pizzeria, o pizza ) Rn = (t pizzeria, o bagels ), (t pizza, o bagels ), (t bagels, o pizza )... Ra = (t pizzeria, u sally, o pizza ) Projection & Selection If we want to know all users who have tagged o pizza, we would write π u (σ o pizza (R a )) and the result would be (u sally ). Semantic Tech & Context - 5
Copyright 2009 by CEBT Dataset The Stanford Tag Crawl Dataset URLs Gathered from del.icio.us recent feed – Pages linked from that URL – Inlinks to the URL – 3,630,250 posts – 2,549,282 unique URLs – 301,499 active unique usernames and about 2TB of crawled data T100 Top 100 Tags in the dataset by frequency Semantic Tech & Context - 6
Copyright 2009 by CEBT Tradeoff We only know R a Constructing a dataset approximating R p and R n for experiments Heymann et al. [8] suggest that if (t i, o k ) ∈ π(t, o) (R a ) then (t i, o k ) ∈ R p. However, (t i, o k ) ̸∈ π(t, o) (R a ) and (t i, o k ) ∈ R p occurs su ffi ciently often measures of precision, recall, and accuracy can be heavily skewed. Author’s method Semantic Tech & Context - 7
Copyright 2009 by CEBT Tag Prediction Using Page Information Page text Anchor Text Surrounding hosts Using Tags Tag prediction based on other tags Semantic Tech & Context - 8
Copyright 2009 by CEBT Using Page Information Prediction as a binary classification task for each tag t i ∈ T 100 Evaluate prediction accuracy using page information on the Top 100 tags 2,145,593 of 9,414,275 triples ( 22.7% ) Split Train/Test Full/Full – Randomly select 11/16 of the positive examples and 11/16 of the negative examples to be the training set – 5/16 for each become the test set 200/200 – Randomly select 200/200 for the training set – Same for the test set Semantic Tech & Context - 9
Copyright 2009 by CEBT #1: Using Page Information Page text All text present at the URL Anchor text All text within fifteen words of inlinks to the URL Surrounding hosts The sites linked to and from the URL As well as the site of the URL itself Penn Tree Tokenizer Make texts bags of words – Token & The number of token occurred Support vector machine for classification Semantic Tech & Context - 10
Copyright 2009 by CEBT Evaluation PRBEP (Precision Recall Break-even Point) A good single number measurement of how we can tradeoff precision for recall For Full/Full PRBEP – for page text was about 60% – for surrounding hosts was about 51% This is pretty good – We can get about 2/3 of the URLs labeled within a particular tag with about 1/3 erroneous URLs in our resulting set (Set Recall = 10%) – About 90% Semantic Tech & Context - 11
Copyright 2009 by CEBT #2: Using Tags Using Association Rules Software, tool, osx -> add ‘mac’ Market-basket data mining Support – The number of baskets containing both X and Y Confidence – How likely is X given X? – P(Y|X) Interest – How much more common is X&Y than expected by chance? – P(Y|X)-P(Y) Here, Baskets : URLs Items : Tags Support>500 and length 2 Support >1000 and length 3 Support >2000 of any length Semantic Tech & Context - 12
Copyright 2009 by CEBT Found Association Rules : Top 30 Semantic Tech & Context - 13
Copyright 2009 by CEBT Found Association Rules : Random Sample Semantic Tech & Context - 14
Copyright 2009 by CEBT Tag Application Simulation Rules Generating association rules based on 50,000 URLs as training set Tags Sampling n bookmarks from 10,000 URLs as test set Stopping applying association rules once reaching a particular minimum confidence c Semantic Tech & Context - 15
Copyright 2009 by CEBT Experimental Results For n = 1,2,3,5, c = 0.5, 0.75, 0.9 Estimated Precision : Average of Applied Confidence Value Semantic Tech & Context - 16
Copyright 2009 by CEBT How useful are Predicted Tags? Increasing Recall for Single Tag Query Let a Tag in Top 100 Tags a Query Recall Increases! Precision Decreases but High! Semantic Tech & Context - 17
Copyright 2009 by CEBT Discussion This paper presents a large scale experiments on real data. Shows us what we can use social prediction for Social Prediction Methods – Using Page Information – Using Tags – Both show quite reasonable results Well organized and written Scientific Experiment design Not a new idea Application of SVM, Basket-item data mining What can we crawl and do some experiment like the authors have done? Naver Knowin? Blogs & Tags? Any interesting idea? Semantic Tech & Context - 18