Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia-Molina Department of Computer Science Stanford University SIGIR 2008 Presentation.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Center for E-Business Technology Seoul National University Seoul, Korea Socially Filtered Web Search: An approach using social bookmarking tags to personalize.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Golder and Huberman, 2006 Journal of Information Science Usage Patterns of Collaborative Tagging System.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Tag Data and Personalized Information Retrieval 1.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
ON INCENTIVE-BASED TAGGING Xuan S. Yang, Reynold Cheng, Luyi Mo, Ben Kao, David W. Cheung {xyang2, ckcheng, lymo, kao, The University.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.
From Social Bookmarking to Social Summarization: An Experiment in Community-Based Summary Generation Oisin Boydell, Barry Smyth Adaptive Information Cluster,
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Center for E-Business Technology Seoul National University Seoul, Korea Social Ranking: Uncovering Relevant Content Using Tag-based Recommender Systems.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
" Ayesha Akbar Shafia Imtiaz Amal Faisal Omer Bin Asad.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Automatic recognition of discourse relations Lecture 3.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
8.2 Estimating Population Means
Sentiment Analysis of Twitter Messages Using Word2Vec
How to forecast solar flares?
Neighborhood - based Tag Prediction
8.2 Estimating Population Means
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Adversarial Learning for Neural Dialogue Generation
Trees, bagging, boosting, and stacking
Source: Procedia Computer Science(2015)70:
Distributed Representation of Words, Sentences and Paragraphs
Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph
iSRD Spam Review Detection with Imbalanced Data Distributions
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
Word embeddings (continued)
Introduction to Sentiment Analysis
Hyperlinks Anchor Tags.
Presentation transcript:

Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia-Molina Department of Computer Science Stanford University SIGIR 2008 Presentation by Sangkeun Lee Center for E-Business Technology Seoul National University Seoul, Korea

Copyright  2009 by CEBT Introduction  Social tags? Keyword annotations  Tags are poorly understood  Given a set of objects, and a set of tags applied to those objects by users Can we predict whether a given tag could/should be applied to a particular object?  We call this the ‘Social Tag Prediction’ Semantic Tech & Context - 2

Copyright  2009 by CEBT What Can we use Social Tag Prediction for?  Increase Recall of Single Tag Queries/Feeds  Inter-User Agreement Sharing objects despite vocabulary differences  Tag Disambiguation What’s ‘Apple’? The Company or the fruit?  Bootstrapping The way users use tags is determined by previous usage Pre-seed  System Suggestion Semantic Tech & Context - 3

Copyright  2009 by CEBT Preliminaries  A social tagging system consists of users u ∈ U, tags t ∈ T, and objects o ∈ O.  A Post A set of tags on an object by a user Is made up of one or more (t i, u j, o k ) triples  R p : a set of tags that describe object the object A set of (t, o) pairs where each pair means that tag t positively describes object o.  R n : a set of tags that do not describe object the object A set of (t, o) pairs where each pair means that tag t negatively describes object o.  R a : a set of tags that users annotates the object. A set of (t, u, o) triples where each triple means that user u annotated object o with tag t. Semantic Tech & Context - 4

Copyright  2009 by CEBT Preliminaries (cont’d)  Examples Rp = (t bagels, o bagels ), (t shop, o bagels ), (t downtown, o bagels ), (t pizza, o pizza ), (t pizzeria, o pizza ) Rn = (t pizzeria, o bagels ), (t pizza, o bagels ), (t bagels, o pizza )... Ra = (t pizzeria, u sally, o pizza )  Projection & Selection If we want to know all users who have tagged o pizza, we would write π u (σ o pizza (R a )) and the result would be (u sally ). Semantic Tech & Context - 5

Copyright  2009 by CEBT Dataset  The Stanford Tag Crawl Dataset URLs Gathered from del.icio.us recent feed – Pages linked from that URL – Inlinks to the URL – 3,630,250 posts – 2,549,282 unique URLs – 301,499 active unique usernames and about 2TB of crawled data  T100 Top 100 Tags in the dataset by frequency Semantic Tech & Context - 6

Copyright  2009 by CEBT Tradeoff  We only know R a Constructing a dataset approximating R p and R n for experiments  Heymann et al. [8] suggest that if (t i, o k ) ∈ π(t, o) (R a ) then (t i, o k ) ∈ R p. However, (t i, o k ) ̸∈ π(t, o) (R a ) and (t i, o k ) ∈ R p occurs su ffi ciently often measures of precision, recall, and accuracy can be heavily skewed.  Author’s method Semantic Tech & Context - 7

Copyright  2009 by CEBT Tag Prediction  Using Page Information Page text Anchor Text Surrounding hosts  Using Tags Tag prediction based on other tags Semantic Tech & Context - 8

Copyright  2009 by CEBT Using Page Information  Prediction as a binary classification task for each tag t i ∈ T 100  Evaluate prediction accuracy using page information on the Top 100 tags  2,145,593 of 9,414,275 triples ( 22.7% )  Split Train/Test Full/Full – Randomly select 11/16 of the positive examples and 11/16 of the negative examples to be the training set – 5/16 for each become the test set 200/200 – Randomly select 200/200 for the training set – Same for the test set Semantic Tech & Context - 9

Copyright  2009 by CEBT #1: Using Page Information  Page text All text present at the URL  Anchor text All text within fifteen words of inlinks to the URL  Surrounding hosts The sites linked to and from the URL As well as the site of the URL itself  Penn Tree Tokenizer Make texts bags of words – Token & The number of token occurred  Support vector machine for classification Semantic Tech & Context - 10

Copyright  2009 by CEBT Evaluation  PRBEP (Precision Recall Break-even Point) A good single number measurement of how we can tradeoff precision for recall  For Full/Full PRBEP – for page text was about 60% – for surrounding hosts was about 51% This is pretty good – We can get about 2/3 of the URLs labeled within a particular tag with about 1/3 erroneous URLs in our resulting set (Set Recall = 10%) – About 90% Semantic Tech & Context - 11

Copyright  2009 by CEBT #2: Using Tags  Using Association Rules Software, tool, osx -> add ‘mac’  Market-basket data mining Support – The number of baskets containing both X and Y Confidence – How likely is X given X? – P(Y|X) Interest – How much more common is X&Y than expected by chance? – P(Y|X)-P(Y)  Here, Baskets : URLs Items : Tags Support>500 and length 2 Support >1000 and length 3 Support >2000 of any length Semantic Tech & Context - 12

Copyright  2009 by CEBT Found Association Rules : Top 30 Semantic Tech & Context - 13

Copyright  2009 by CEBT Found Association Rules : Random Sample Semantic Tech & Context - 14

Copyright  2009 by CEBT Tag Application Simulation  Rules Generating association rules based on 50,000 URLs as training set  Tags Sampling n bookmarks from 10,000 URLs as test set  Stopping applying association rules once reaching a particular minimum confidence c Semantic Tech & Context - 15

Copyright  2009 by CEBT Experimental Results  For n = 1,2,3,5, c = 0.5, 0.75, 0.9  Estimated Precision : Average of Applied Confidence Value Semantic Tech & Context - 16

Copyright  2009 by CEBT How useful are Predicted Tags?  Increasing Recall for Single Tag Query  Let a Tag in Top 100 Tags a Query  Recall Increases!  Precision Decreases but High! Semantic Tech & Context - 17

Copyright  2009 by CEBT Discussion  This paper presents a large scale experiments on real data. Shows us what we can use social prediction for Social Prediction Methods – Using Page Information – Using Tags – Both show quite reasonable results  Well organized and written  Scientific Experiment design  Not a new idea Application of SVM, Basket-item data mining  What can we crawl and do some experiment like the authors have done? Naver Knowin? Blogs & Tags? Any interesting idea? Semantic Tech & Context - 18