Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Co Training Presented by: Shankar B S DMML Lab
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Multi-Label Prediction via Compressed Sensing By Daniel Hsu, Sham M. Kakade, John Langford, Tong Zhang (NIPS 2009) Presented by: Lingbo Li ECE, Duke University.
Center for E-Business Technology Seoul National University Seoul, Korea Socially Filtered Web Search: An approach using social bookmarking tags to personalize.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Evaluating Search Engine
Ensemble Learning: An Introduction
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Ensemble Learning (2), Tree and Forest
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
The identification of interesting web sites Presented by Xiaoshu Cai.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
ON INCENTIVE-BASED TAGGING Xuan S. Yang, Reynold Cheng, Luyi Mo, Ben Kao, David W. Cheung {xyang2, ckcheng, lymo, kao, The University.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Date: 2013/8/27 Author: Shinya Tanaka, Adam Jatowt, Makoto P. Kato, Katsumi Tanaka Source: WSDM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Estimating.
Confidence-Aware Graph Regularization with Heterogeneous Pairwise Features Yuan FangUniversity of Illinois at Urbana-Champaign Bo-June (Paul) HsuMicrosoft.
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.
Center for E-Business Technology Seoul National University Seoul, Korea Social Ranking: Uncovering Relevant Content Using Tag-based Recommender Systems.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
" Ayesha Akbar Shafia Imtiaz Amal Faisal Omer Bin Asad.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Post-Ranking query suggestion by diversifying search Chao Wang.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
The Wisdom of the Few Xavier Amatrian, Neal Lathis, Josep M. Pujol SIGIR’09 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
On Stability, Clarity, and Co-occurrence of Self-Tagging Aixin Sun and Anwitaman Datta Nanyang Technological University Singapore.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia-Molina Department of Computer Science Stanford University SIGIR 2008 Presentation.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Source: Procedia Computer Science(2015)70:
Tagging with Queries: How and Why?
Mining Query Subtopics from Search Log Data
Intent-Aware Semantic Query Annotation
Introduction Task: extracting relational facts from text
Mining Anchor Text for Query Refinement
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
Presentation transcript:

Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008

Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

Introduction Social tag allows users to contribute metadata to large and dynamic corpora Social tag allows users to contribute metadata to large and dynamic corpora Social tag prediction problem Social tag prediction problem –Given a set of objects and a set of tags, can we predict whether a given tag could/should be applied to a particular object?

Benefits of Predicting Social Tags At a fundamental level, we gain insights into the “ information content ” of tags At a fundamental level, we gain insights into the “ information content ” of tags –If tags are easy to predict from other content, they add little value At a practical level, a tag predictor can enhance a social tagging site in a variety of forms At a practical level, a tag predictor can enhance a social tagging site in a variety of forms –Increase recall of single tag queries/feeds –Inter-user agreement –Tag disambiguation –Bootstrapping –System suggestion

Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

Preliminaries A post is a set of triples (t i, u j, o k ) indicating a user u j annotated object o k by a set of tags A post is a set of triples (t i, u j, o k ) indicating a user u j annotated object o k by a set of tags Imagining a tag do or do not describe an object, there are three 3 relations: Imagining a tag do or do not describe an object, there are three 3 relations: –R p = a set of (t, o) pairs where each pair means that tag t positively describes object o –R n = a set of (t, o) pairs where each pair means that tag t negatively describes object o –R a = a set of (t, u, o) triples where each triple means that user u annotated object o with tag t T 100 = the 100 most frequent tags T 100 = the 100 most frequent tags

Operators and Examples Two standard relational algebra operators Two standard relational algebra operators –σ c selects tuples from a relation where a particular condition c holds (WHERE in SQL) –π p projects a relation into a smaller number of attributes (SELECT in SQL) Example: for a web o bagels about a downtwon bagel shop and a web page o pizza about a pizzeria, Example: for a web o bagels about a downtwon bagel shop and a web page o pizza about a pizzeria, R p = (t bagels, o bagels ), (t shop, o bagels ), (t downtown, o bagels ), (t pizza, o pizza ), (t pizzeria, o pizza ) (t pizza, o pizza ), (t pizzeria, o pizza ) R n = (t pizzeria, o bagels ), (t pizza, o bagels ), (t bagels, o pizza ) … π t (σ O bagels (R p )) = tags which positively describe o bagels = (t bagels, t shop, t downtown ) = (t bagels, t shop, t downtown )

Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

Dataset The base: Stanford Tag Crawl Dataset The base: Stanford Tag Crawl Dataset –Gathered from del.icio.us –Consist of 2,549,282 unique URLs with their posts –Anchor text and Link information for each URL Experimental dataset construction Experimental dataset construction –Aiming to approximate R p and R n –Assume that if (t i, o k ) π (t, o) (R a ) then (t i, o k ) R p The reverse is not true The reverse is not true –Filter the dataset by postcount(o k ) = |π u (σ Ok (R a ))| Assume as postcount(o k ) increases, R p is approximated by R a Assume as postcount(o k ) increases, R p is approximated by R a Filtering threshold = 100 Filtering threshold = 100 –62,000 URLs in the filtered set

Probability of Adding “ New ” Tags Figure: Average new tags (in T 100 ) versus number of posts

Comparison between Popular Tags Table: The top/bottom tags in T 100 to be added after the 100th bookmark. The top 15 tags are relatively ambiguous and personal.

Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

Features for SVM Page text features Page text features –Bag of words Anchor text Anchor text –Bag of words –Text within 15 words of inlinks to the URL –Use only URLs with at least 100 inlinks as examples Surrounding hosts Surrounding hosts –Hosts/domains of backlinks –Hosts/domains of the URL –Hosts/domains of forward links For each feature type, the top 1000 features selected by mutual information are used For each feature type, the top 1000 features selected by mutual information are used

Experiment Setup Binary tag classification by SVM for T 100 Binary tag classification by SVM for T 100 –SVMlight and SVMperf with a linear kernel Data splits Data splits –Full/Full: 11/16 positive/negative examples for training and the rest for testing Evaluated by precision-recall BEP (PRBEP) instead of accuracy Evaluated by precision-recall BEP (PRBEP) instead of accuracy –200/200: randomly select 200 positive/negative examples for training and the same for testing Evaluated by accuracy Evaluated by accuracy Provided as an imperfect indication of how predictable a tag is due to its “ information content ” rather than the distribution of examples in the system Provided as an imperfect indication of how predictable a tag is due to its “ information content ” rather than the distribution of examples in the system

Order of Predictability Predictability = PRBEP (Full/Full) + (Full/Full) + Accuracy (200/200) Predictability = PRBEP (Full/Full) + (Full/Full) + Accuracy (200/200) Figure: Tags in T 100 in increasing order of predictability from left to right.

Discussions What precision can we get at the PRBEP? What precision can we get at the PRBEP? –60% for page text, 58% for anchor text, and 51% for surrounding hosts –Much better than chance given a majority of tags in T 100 occur on less than 15% of documents What precision can we get with low recall? What precision can we get with low recall? –90% for all features and 92.5% for page text in (Full/Full) Which page information is best for predicting tags? Which page information is best for predicting tags? –Page text > anchor text > surrounding hosts

What makes a tag predictable? (1/2) Entropy measure: Entropy measure:

What makes a tag predictable? (2/2) Figure: Tag popularity positively correlated to PRBEP in the Full/Full split

Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

Tag Prediction Using Tags Between about 30 and 50 percent of URLs posted to del.icio.us have only 1 or 2 bookmarks Between about 30 and 50 percent of URLs posted to del.icio.us have only 1 or 2 bookmarks –Recall for single tag queries will be low The question: given a small number of tags, how much can we expand this set of tags in a high precision manner? The question: given a small number of tags, how much can we expand this set of tags in a high precision manner? –Similar to market-basket data mining A large set of items and a large set of baskets each of which contains a small set of items A large set of items and a large set of baskets each of which contains a small set of items The goal is to find correlations between sets of items The goal is to find correlations between sets of items –The baskets are URLs and the items are tags

Association Rules Suport: the number of baskets containing both X and Y Suport: the number of baskets containing both X and Y Confidence: P(Y |X ) (How likely is Y given X ?) Confidence: P(Y |X ) (How likely is Y given X ?) Interest: P(Y |X ) - P(Y ) (How much more common is X &Y than expected by chance?) Interest: P(Y |X ) - P(Y ) (How much more common is X &Y than expected by chance?)

Found Association Rules Observed relations: type-of, various forms, translations … etc Observed relations: type-of, various forms, translations … etc

Found Association Rules Random sampling of the top 8000 rules of length 3 or less Random sampling of the top 8000 rules of length 3 or less

Simulation of Tag Expansion About 50,000 URLs for training and 10,000 URLs for testing About 50,000 URLs for training and 10,000 URLs for testing

Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

Conclusions Our tag prediction results suggest three insights: Our tag prediction results suggest three insights: –Many tags on the web do not contribute substantial additional information beyond page text, anchor text, and surrounding hosts. –The predictability of a tag is negatively correlated with its entropy, when our classifiers are given balanced training data. When considering tags in their natural distributions, data sparsity issues tend to dominate. –Association rules can increase recall on single tag queries. We found association rules linking languages, super/subconcepts, and other relationships.