Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Strategies LLCTaxonomy May 22, 2006Copyright 2006 Taxonomy Strategies LLC. All rights reserved Enterprise Search Summit Taxonomy Fundamentals Workbook.
Advertisements

1 Mining a Web 2.0 service for the discovery of semantically similar terms: A case study with Del.icio.us Kwan Yi School of Library and Information Science.
Data Publishing on Web 2.0 Leigh Dodds, Chief Technology Officer, Ingenta 1 st February 2007.
Creating Collaborative Partnerships
One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.
Publishing and the Web. What do online customers want? The Google generation expect: To find everything quickly & efficiently Websites to be easy to use.
WEB 2.0. What we are speaking about… Transformation of WEB, the WEB 2.0 –New generation of websites… –Importance of Open Data… –Importance of Users… –Web.
RSS Feeds Real Simple Syndication: The New Killer App for Educators.
EventCube Aviation Safety Data Analysis System Fangbo Tao, Xiao Yu, Jiawei Han 08/10/13.
Web 2.0 The Read/Write Web. History Tim Berners-Lee: World Wide Web 1989 Dream of sharing information back and forth Mosaic Web browser in 1993 Writing.
Stephan Gammeter, Lukas Bossard, Till Quack, Luc Van Gool.
Web 2.0 IMD Web Authoring. Content What is Web 2.0 Search Content Networks User Generated Content Blogging Social networking Social Media.
Interactive 2.0 Presentation by Ben Gregg December 27, 2007.
Web Design 101 John Schuster, Yourlink Web Services Inc.
ICT Issues Social Networking. Social Networking Social networking: the interaction between a group of people who have a common interest, eg. music. Popular.
Enhancing Research Projects with Environmental Informatics and Web Technologies.
Web Huh?! Leigh Dodds, Engineering Manager, IngentaConnect 23 rd May 2006.
Information Retrieval in Practice
Computers Are Your Future Twelfth Edition
SM5312 week 1: course overview1 SM5312 Interactive Media 1 Nick Foxall.
Social Bookmarking & Research What Delicious can do for you.
 2008 Pearson Education, Inc. All rights reserved What Is Web 2.0?  Web 1.0 focused on a relatively small number of companies and advertisers.
CS580: Building Web Based Information Systems Roger Alexander & Adele Howe The purpose of the course is to teach theory and practice underlying the construction.
Overview of Search Engines
SEARCH ENGINE OPTIMIZATION Donna Habersaat. WHAT IS SEARCH ENGINE OPTIMIZATION (SEO)  Search Engine Optimization (SEO) is the process of setting up your.
Discovering Web 2.0 and Social Media Jeff Coburn, NSIP – Senior Web Specialist.
WEB 2.0: Definitions, glossary, tools and uses. Use web 2.0 tools to create vibrant learning communities.
Business Driven Technology Unit 4
Introduction to social software in the enterprise “There’s something happening here, what it is ain’t exactly clear.” - Quoted from John Hagel on Web2.0.
Free e-Sources for English Language Teachers by Wallace Barboza Carolina TESOL December 6th, 2008 Charleston, SC.
Tag-based Social Interest Discovery
Consider ways to use social software in your professional learning and school.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Web 2.0: Concepts and Applications 4 Organizing Information.
Ajax-based startpage Web top Personal web portal Page aggregator –Netvibes, My Yahoo!, iGoogle, Page Flakes, Wakooz, and Microsoft Live. –Personalize.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
AVI/Psych 358/IE 340: Human Factors Web 2.0 November
By : Garima Indurkhya Jay Parikh Shraddha Herlekar Vikrant Naik.
Web 2.0: An Introduction 許輝煌 淡江大學資訊工程系 NUK.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Podcasting Nick Weare Radio and Recorded Sound Specialist National Film and Sound Archive.
Web Applications BIS4430 – unit 8. Learning Objectives Explain the uses of web application frameworks Relate the client-side, server-side architecture.
29-30 October, 2006, Estonia 1 IST4Balt Information analysis using social bookmarking and other tools IST4Balt Information analysis using social bookmarking.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Rich Internet Applications 1. “Web 2.0” and Rich Internet Applications.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Information Retrieval Effectiveness of Folksonomies on the World Wide Web P. Jason Morrison.
Lecture 1 Jan 08, Outline Course logistics Introducing tools to be used in the course Overview of Social Web and Web 2.0 Definition History Key.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Facilitating Document Annotation using Content and Querying Value.
1 Alternative view on Internet Computing Web 1.0 –Web 1.0 is first generation, Web Information based. Driven by Information provider. Web 2.0 Ajax enabled.
Web Review The Web Web 1.0 Web 2.0 Future of the Web Internet Programming - Chapter 01:XHTML1.
PowerPoint Presentation to Accompany Chapter 8 Communicating & Sharing: The Social Web Visualizing TechnologyCopyright © 2014 Pearson Education, Inc. Publishing.
Optimizing today's websites using tomorrow's technologies.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Automatic Labeling of Multinomial Topic Models
Web 2.0 Ali Ghandour Based on slides from: Clara Ko, EuropeanPWN Amsterdam.
Invitation to Computer Science 6 th Edition Chapter 10 The Tower of Babel.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Association for Progressive Communications Web 2.0 From networking documents to networking people.
Finding similar items by leveraging social tag clouds Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: SAC 2012’ Date: October 4, 2012.
ACSIUS Technologies Pvt. Ltd. Tomorrow’s Success Starts Today!
Information Retrieval in Practice
Neighborhood - based Tag Prediction
Search Engine Architecture
Presentation transcript:

Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign ‡ University of California at Santa Cruz

Tagging a Web Document The dual problem of search/retrieval: [Mei et al. 2007] –Retrieval: short description (query)  relevant documents –Tagging: document  short description (tag) To summarize the content of documents To access the document in the future 2 Text Document Query/Tag retrieval tagging

Social Bookmarking of Web Documents 3 Web documents Social bookmarks (tags)

Existing Work on Social Bookmarking Social Bookmarking Systems –Del.icio.us, Digg, Citeulike, etc. Enhance Social bookmarking systems –Anti-spam [Koutrika et al 2007] –Search& ranking tags [Hotho et al 2006] Utilize social bookmarks –Visualization [Dubinko et al. 2006] –Summarization [Boydell et al. 2007] –Use tags to help web search: [Heymann et al. 2008]; [Zhou et al. 2008] 4

Research Questions Can we automatically generate tags for web documents? –Meaningful, compact, relevant Can we generate tags for other web objects, such as web users? 5

Applications of Automatic Tagging Summarizing documents/ web objects Suggest social bookmarks Refine queries for web search –Finding good queries to a document Suggest good keywords for online advertising 6

7 Rest of the Talk A probabilistic approach to tag generation –Candidate Tag Selection –Web document representation –Tag ranking Experiments –Web documents tagging; –web user tagging Summary

Our Method 8 data statistics tutorial analysis software model frequent probabilistic algorithm … ipod nano, data mining, presidential campaign index structure, statistics tutorial, computer science… Candidate tag pool data mining 0.26 statistics tutorial 0.19 computer science 0.17 index structure 0.01 …… ipod nano presidential campaign 0.0 …… Ranking candidate tags User-Generated Corpus (e.g., Del.icio.us, Wikipedia) Web Documents Multinomial word Distribution representation

Candidate Tag Selection Meaningful, compact, user-oriented From social bookmarking data –E.g., Del.icio.us –Single tags  tags that other people used –“phrases”  statistically significant bigrams From other user-generated web contents –E.g., Wikipedia –Titles of entries in wikipedia 9

Representation of Web Documents Multinomial distribution of words (unigram language models) –Commonly used in retrieval and text mining Can be estimated from the content of the document, or from social bookmarks (our approach) –What other people used to tag that document 10 text 0.16 mining 0.08 data 0.07 probabilistic 0.04 independence 0.03 model 0.03 … Baseline: Use the top words in that distribution to tag a document

Tag Ranking: A Probabilistic Approach Web documents d  a language model A candidate tag t  a language model from its co-occurring tags Score and rank t by KL-divergence of these two language models 11 Social Bookmark Collection

Rewriting and Efficient Computation 12 Bias of using C to represent candidate tag t Bias of using C to represent document d (e.g., del.icio.us) 1. Can be pre-computed from corpus; 2. Only store those PMI(w,t|C) > 0

Tagging Web Users Summarize the interests and bias of a user Web user  a pseudo document Estimate a language model from all tags that he used The rest is similar to web document tagging 13

Experiments Dataset: –Two-week tagging records from Del.icio.us –Candidate tags: Top 15,000 Significant 2-grams from del.icio.us; titles of all wikipedia entries (5,836,166 entries, around 48,000 appeared in del.icio.us) 14 Time SpanBookmarksDistinct Tags Distinct Users 02/13/07 ~ 02/26/07579,652111,38120,138

Tagging Web Documents 15 UrlsLM p(w|d)Tag = WordTag = bigramTag = wikipedia title (158 bookmarks) color design webdesign tools adobe graphics flash color colour palette colorscheme colours picker cor adobe color color design color colour color colors colour design inspiration palette webdesign color color colour palette web color colours cor rgb watch?v=6gmP4nk0EOE watch?v=6gmP4nk0EOE (157 bookmarks) web2.0 video youtube web internet xml community youtube revver vodcast primer comunidad participation ethnograpy xml youtube web2.0 youtube video web2.0 web2.0 xml online presentation social video youtube video internet video youtube revver research video vodcast primer p2p TV Too general, sometimes not relevant Relevant, precise Meaningful, relevant overfit data, not real phrases Meaningful, relevant Meaningful, relevant, real But partially covers good tags But sometimes not meaningful overfit data, not real phrases

Tagging Web Documents (Cont.) 16 UrlsLM p(w|d)Tag = WordTag = bigramTag = wikipedia title (386 bookmarks) yahoo rss web2.0 mashup feeds programming pipes pipes feeds yahoo mashup rss syndication mashups feeds mashups mashup pipes web2.0 yahoo rss web2.0 mashup rss api feeds pipes prog- ramming pipes yahoo mashups rss syndication mashups blog feeds (349 bookmarks) ajax javascript web2.0 webdesign programming code webdev ajax dhtml javascript moo.fx dragdrop phototype autosuggest ajax code code javascript javascript ajax javascript web- 2.0 css ajax javascript pro- gramming ajax dhtml javascript moo.fx javascript li- brary javascript - framework Too general, sometimes not relevant Relevant, precise But sometimes not meaningful Meaningful, relevant overfit data, not real phrases Meaningful, relevant, real

Tagging Web Users 17 UsersLM p(w|d)Tag = bigramTag = wikipedia title User 1 photography art portraits tools web design geek art photography photography portraits digital flickr photoblog photography art photo flickr photography weblog wordpress art photography photoblog portraits photography landscapes flickr art contest User 2 humor programming photography blog webdesign security funny geek hack humor programming hack hacking networking programming geek html geek hacking reference security network programming tweak hacking security geek humor sysadmin digitalcamera Partially covers the interest Meaningful, relevant, real overfit data, not real phrases

Tagging Web Users (Cont.) 18 UsersLM p(w|d)Tag = bigramTag = wikipedia title User 3 games arg tools programming sudoku cryptography software arg games games puzzles games internet arg code games sudoku code generator community games arg games research games puzzles storytelling code generator community games User 4 web reference css development rubyonrails tools design rubyonrails web css development brower development development editor development forum development firefox javascript tools javascript css webdev xhtml dhtml css3 dom Missed many good tags

Discussions Using top tags: too general, sometimes not relevant Ranking tags by labeling language models: –Candidate = Social bookmarking words Pros: relevant, compact Cons: ambiguous, not so meaningful –Candidate = Social bookmarking bigrams Pros: more meaningful, relevant Cons: overfiting the data, sometimes not real phrases –Candidate = Wikipedia Titles: Pros: meaningful, relevant real phrases Cons: biased, missed potential good tags. (Bias(t, C)) 19

Summary Automatic tagging of web documents and web users A probabilistic approach based on labeling language models Effective when the candidate tags are of high quality Future work: –A robust way of generating candidate tags –Large scale evaluation 20

Thanks! 21