Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign ‡ University of California at Santa Cruz
Tagging a Web Document The dual problem of search/retrieval: [Mei et al. 2007] –Retrieval: short description (query) relevant documents –Tagging: document short description (tag) To summarize the content of documents To access the document in the future 2 Text Document Query/Tag retrieval tagging
Social Bookmarking of Web Documents 3 Web documents Social bookmarks (tags)
Existing Work on Social Bookmarking Social Bookmarking Systems –Del.icio.us, Digg, Citeulike, etc. Enhance Social bookmarking systems –Anti-spam [Koutrika et al 2007] –Search& ranking tags [Hotho et al 2006] Utilize social bookmarks –Visualization [Dubinko et al. 2006] –Summarization [Boydell et al. 2007] –Use tags to help web search: [Heymann et al. 2008]; [Zhou et al. 2008] 4
Research Questions Can we automatically generate tags for web documents? –Meaningful, compact, relevant Can we generate tags for other web objects, such as web users? 5
Applications of Automatic Tagging Summarizing documents/ web objects Suggest social bookmarks Refine queries for web search –Finding good queries to a document Suggest good keywords for online advertising 6
7 Rest of the Talk A probabilistic approach to tag generation –Candidate Tag Selection –Web document representation –Tag ranking Experiments –Web documents tagging; –web user tagging Summary
Our Method 8 data statistics tutorial analysis software model frequent probabilistic algorithm … ipod nano, data mining, presidential campaign index structure, statistics tutorial, computer science… Candidate tag pool data mining 0.26 statistics tutorial 0.19 computer science 0.17 index structure 0.01 …… ipod nano presidential campaign 0.0 …… Ranking candidate tags User-Generated Corpus (e.g., Del.icio.us, Wikipedia) Web Documents Multinomial word Distribution representation
Candidate Tag Selection Meaningful, compact, user-oriented From social bookmarking data –E.g., Del.icio.us –Single tags tags that other people used –“phrases” statistically significant bigrams From other user-generated web contents –E.g., Wikipedia –Titles of entries in wikipedia 9
Representation of Web Documents Multinomial distribution of words (unigram language models) –Commonly used in retrieval and text mining Can be estimated from the content of the document, or from social bookmarks (our approach) –What other people used to tag that document 10 text 0.16 mining 0.08 data 0.07 probabilistic 0.04 independence 0.03 model 0.03 … Baseline: Use the top words in that distribution to tag a document
Tag Ranking: A Probabilistic Approach Web documents d a language model A candidate tag t a language model from its co-occurring tags Score and rank t by KL-divergence of these two language models 11 Social Bookmark Collection
Rewriting and Efficient Computation 12 Bias of using C to represent candidate tag t Bias of using C to represent document d (e.g., del.icio.us) 1. Can be pre-computed from corpus; 2. Only store those PMI(w,t|C) > 0
Tagging Web Users Summarize the interests and bias of a user Web user a pseudo document Estimate a language model from all tags that he used The rest is similar to web document tagging 13
Experiments Dataset: –Two-week tagging records from Del.icio.us –Candidate tags: Top 15,000 Significant 2-grams from del.icio.us; titles of all wikipedia entries (5,836,166 entries, around 48,000 appeared in del.icio.us) 14 Time SpanBookmarksDistinct Tags Distinct Users 02/13/07 ~ 02/26/07579,652111,38120,138
Tagging Web Documents 15 UrlsLM p(w|d)Tag = WordTag = bigramTag = wikipedia title (158 bookmarks) color design webdesign tools adobe graphics flash color colour palette colorscheme colours picker cor adobe color color design color colour color colors colour design inspiration palette webdesign color color colour palette web color colours cor rgb watch?v=6gmP4nk0EOE watch?v=6gmP4nk0EOE (157 bookmarks) web2.0 video youtube web internet xml community youtube revver vodcast primer comunidad participation ethnograpy xml youtube web2.0 youtube video web2.0 web2.0 xml online presentation social video youtube video internet video youtube revver research video vodcast primer p2p TV Too general, sometimes not relevant Relevant, precise Meaningful, relevant overfit data, not real phrases Meaningful, relevant Meaningful, relevant, real But partially covers good tags But sometimes not meaningful overfit data, not real phrases
Tagging Web Documents (Cont.) 16 UrlsLM p(w|d)Tag = WordTag = bigramTag = wikipedia title (386 bookmarks) yahoo rss web2.0 mashup feeds programming pipes pipes feeds yahoo mashup rss syndication mashups feeds mashups mashup pipes web2.0 yahoo rss web2.0 mashup rss api feeds pipes prog- ramming pipes yahoo mashups rss syndication mashups blog feeds (349 bookmarks) ajax javascript web2.0 webdesign programming code webdev ajax dhtml javascript moo.fx dragdrop phototype autosuggest ajax code code javascript javascript ajax javascript web- 2.0 css ajax javascript pro- gramming ajax dhtml javascript moo.fx javascript li- brary javascript - framework Too general, sometimes not relevant Relevant, precise But sometimes not meaningful Meaningful, relevant overfit data, not real phrases Meaningful, relevant, real
Tagging Web Users 17 UsersLM p(w|d)Tag = bigramTag = wikipedia title User 1 photography art portraits tools web design geek art photography photography portraits digital flickr photoblog photography art photo flickr photography weblog wordpress art photography photoblog portraits photography landscapes flickr art contest User 2 humor programming photography blog webdesign security funny geek hack humor programming hack hacking networking programming geek html geek hacking reference security network programming tweak hacking security geek humor sysadmin digitalcamera Partially covers the interest Meaningful, relevant, real overfit data, not real phrases
Tagging Web Users (Cont.) 18 UsersLM p(w|d)Tag = bigramTag = wikipedia title User 3 games arg tools programming sudoku cryptography software arg games games puzzles games internet arg code games sudoku code generator community games arg games research games puzzles storytelling code generator community games User 4 web reference css development rubyonrails tools design rubyonrails web css development brower development development editor development forum development firefox javascript tools javascript css webdev xhtml dhtml css3 dom Missed many good tags
Discussions Using top tags: too general, sometimes not relevant Ranking tags by labeling language models: –Candidate = Social bookmarking words Pros: relevant, compact Cons: ambiguous, not so meaningful –Candidate = Social bookmarking bigrams Pros: more meaningful, relevant Cons: overfiting the data, sometimes not real phrases –Candidate = Wikipedia Titles: Pros: meaningful, relevant real phrases Cons: biased, missed potential good tags. (Bias(t, C)) 19
Summary Automatic tagging of web documents and web users A probabilistic approach based on labeling language models Effective when the candidate tags are of high quality Future work: –A robust way of generating candidate tags –Large scale evaluation 20
Thanks! 21