Learning with Social Media May 19th, 2011@CCP Learning with Social Media Good morning professors, I am Tom. thanks for coming to my thesis defense. My thesis title is Learning with Social Media. Tom Chao Zhou @Thesis Defense Thesis Committee: Prof. Yu Xu Jeffrey (Chair) Prof. Zhang Sheng Yu (Committee Member) Prof. Yang Qiang (External Examiner) Supervisors: Prof. Irwin King Prof. Michael R. Lyu
Item Recommendation with Tagging Ensemble May 19th, 2011@CCP Introduction Background Item Recommendation with Tagging Ensemble This is today’s agenda. US: [əˈdʒendə]. I will start from Introduction, and then introduce background on related techniques and applications. After that, I will present 4 work in my thesis. Finally I will present conclusion and future work. User Recommendation via Interest Modeling Item Suggestion with Semantic Analysis Item Modeling via Data-Driven Approach Conclusion and Future Work 2
Item Recommendation with Tagging Ensemble May 19th, 2011@CCP Introduction Background Item Recommendation with Tagging Ensemble Let’s start from introduction User Recommendation via Interest Modeling Item Suggestion with Semantic Analysis Item Modeling via Data-Driven Approach Conclusion and Future Work 3
Social Media What is Social Media? Some Data May 19th, 2011@CCP Social Media What is Social Media? Create, share, exchange; virtual communities Some Data 45 million reviews in a travel forum TripAdvisor [Source] 218 million questions solved in Baidu Knows [Source] Twitter processed one billion tweets in Dec 2009, averages almost 40 million tweets per day [Source] Time spent on social media in US: 88 billion minutes in July 2011, 121 billion minutes in July 2012 [Source] What is social media? Social media systems are Web 2.0 platforms, in which people can create, share, and exchange information. Usually social media users form virtual communities and networks Why we should care about social media? Let’s look at some data… From these data, we know that social media systems have significant impacts on people’s daily lives.
Examples of Social Media May 19th, 2011@CCP Examples of Social Media Rating System America’s largest online retailer There are many types of social media systems. I will briefly go through some typical examples. The 1st type is rating system. In rating systems, users could rate items. In return, an active user would receive suggested items that he/she may be interested in based on rating information of other users or items. The largest C2C website in China, over 2 billion products The biggest movie site on the planet, over 1,424,139 movies and TV episodes
Examples of Social Media May 19th, 2011@CCP Examples of Social Media Social Tagging System The largest social bookmarking website The best online photo management and sharing application in the world The 2nd type is Social tagging systems, such as Delicious, Flickr, have emerged as an effective way for users to annotate and share objects on the Web. Tags posted by users on bookmarks, papers, and photos, express users’ understandings and interests.
Examples of Social Media May 19th, 2011@CCP Examples of Social Media Online Forum The 3rd type is online forum. An online forum is a Web application which involves highly interactive and semantically related discussions on domain specific questions, such as travel, sports, programming, etc.
Examples of Social Media May 19th, 2011@CCP Examples of Social Media Community-based Question Answering 10 questions and answers are posted per second 218 million questions have been solved The 4th type is community-based question answering services. Recent years, they have emerged as an effective way for knowledge sharing and information seeking. By allowing users to ask complex natural language questions and to answer other users’ questions, users’ information needs are met by explicit, self-contained answers instead of lists of Web pages. A popular website with many experts and high quality answers
Challenges in Social Media May 19th, 2011@CCP Challenges in Social Media Astronomical growth of data in Social Media Huge, diverse and dynamic Drowning in information, information overload With the astronomical US: [ˌæstrəˈnomɪk(ə)l] growth of data in social media, social media systems have become huge, diverse and dynamic. As a result, users are currently drowning in information and facing information overload.
May 19th, 2011@CCP Objective of Thesis Establish automatic and scalable models to help social media users find their information needs more effectively Here comes the objective of this thesis. The objective is to establish automatic and scalable models to help social media users find their information needs more effectively.
May 19th, 2011@CCP Objective of Thesis Modeling users’ interests with respect to their behavior, and recommending items or users they may be interested in Chapter 3, 4 Understanding items’ characteristics, and grouping items that are semantically related for better addressing users’ information needs Chapter 5, 6 On one side, we model users’ interests with respect to their behavior, and recommend items or users they may be interested in. On the other side, we understand items’ characteristics, and group items that are semantically related for better addressing users’ information needs.
Structure of Thesis Social Media User Item Item User Item May 19th, 2011@CCP Structure of Thesis Consumption Goods Social Media Participant User Item Now I will introduce the structure of thesis. We propose models for learning with social media. In social media, there are two key roles: user and item. User, acting as the participant of social media, generates, shares content, and interacts with each other. Item, acting as the consumption goods, would satisfy users’ information needs. To help a user satisfy his/her information needs effectively, the 1st work focuses on predicting users’ preferences on items, and details are introduced in chapter 3. The 2nd work focuses on finding users with similar interests to an active user, chapter 4 presents models. Besides user-oriented approaches, we also conduct item-oriented research. We propose models to find semantically related items to a given item, and details are in chapter 5. Finally, we propose a data driven approach to study items’ characteristics, and chapter 6 presents our approach. Item User Item Characteristic Chap. 3 Chap. 4 Chap. 5 Chap. 6
Item Recommendation with Tagging Ensemble May 19th, 2011@CCP Introduction Background Item Recommendation with Tagging Ensemble Now let’s look at background knowledge. User Recommendation via Interest Modeling Item Suggestion with Semantic Analysis Item Modeling via Data-Driven Approach Conclusion and Future Work 13
Collaborative Filtering May 19th, 2011@CCP Recommender Systems Memory-based algorithms User-based Item-based Similarity methods Pearson correlation coefficient (PCC) Vector space similarity (VSS) Disadvantage of memory-based approaches Recommendation performances deteriorate when the rating data is sparse User-based Algorithm Memory-based Algorithm Collaborative Filtering Item-based Algorithm Model-based Algorithm The thesis is related to recommender systems. Typically, Recommender Systems are based on Collaborative Filtering, which aggregates opinions of a great many users to predict and make recommendations for an active user. Memory-based and model-based approaches are two typical approaches in collaborative filtering. In memory-based algorithms ['ælgə'rɪðəm], there are user-based and item-based algorithms. User-based algorithms first look for some similar users who have similar rating styles with the active user, and then employ the ratings from those similar users to predict the ratings for the active user. Item-based algorithms share similar idea with user-based methods except for finding similar items for each item. PCC and VSS are two similarity methods employed in memory-based approaches. The disadvantage of memory-based approaches is that recommendation performances deteriorate [dɪ'tɪrɪəreit] when the rating data is sparse.
Collaborative Filtering May 19th, 2011@CCP Recommender Systems Model-based algorithms Clustering methods Matrix factorization methods Disadvantage of traditional model-based approaches Only use the user-item rating matrix, ignore other user behavior Suffer the problem of data sparsity Clustering Algorithm Model-based Algorithm Collaborative Filtering Matrix Factorization Memory-based Algorithm There are also many examples of model-based algorithms, such as clustering methods and matrix factorization methods. The disadvantage of traditional model-based approaches is that many approaches only use the user-item rating matrix, ignore other user behavior. They also suffer the problem of data sparsity.
Machine Learning Whether the training data is available May 19th, 2011@CCP Machine Learning Whether the training data is available Yes? Supervised learning Naive Bayes, support vector machines Some? Semi-supervised learning Co-training, graph-based approach No? Unsupervised learning Clustering, Latent Dirichlet Allocation Machine learning techniques are also used in this thesis. Based on whether the training data is available, machine learning models could be categorized as three types: supervised learning, semi-supervised learning, and unsupervised learning.
Information Retrieval May 19th, 2011@CCP Information Retrieval Information Retrieval Models Seek an optimal ranking function Vector Space Model Weighting (TF-IDF) Probabilistic Model and Language Model Binary independence model, query likelihood model Translation Model Originated from machine translation Solve the lexical gap problem Information retrieval models are also investigated in this thesis. The target of information retrieval models is to seek an optimal ranking function. Vector space model is a famous ir model, in which documents and queries are represented as vectors in a common space. Due to the complete probability framework, probabilistic model and language model are commonly used ir models. Recently, translation model has become popular since it could solve the lexical gap problem to some extent.
Information Retrieval May 19th, 2011@CCP Techniques Employed Recommender Systems Information Retrieval Machine Learning Chapter 3 Chapter 4 Chapter 5 Chapter 6 In summary, in chapter 3, techniques of recommender systems are investigated and developed. After that, I apply recommender systems models in chapter 4, with additional knowledge of information retrieval models. Then, I utilize both information retrieval models and machine learning techniques in chapter 5. Finally, I develop a data-driven approach with machine learning in chapter 6.
Item Recommendation with Tagging Ensemble May 19th, 2011@CCP Introduction Background Item Recommendation with Tagging Ensemble Let’s talk about the first work User Recommendation via Interest Modeling Item Suggestion with Semantic Analysis Item Modeling via Data-Driven Approach Conclusion and Future Work 19
Structure of Thesis Social Media User Item Item User Item May 19th, 2011@CCP Structure of Thesis Consumption Goods Social Media Participant User Item The 1st work predicts users’ preferences on items, and the details are introduced in chapter 3. Item User Item Characteristic Chap. 3 Chap. 4 Chap. 5 Chap. 6
A Toy Example The Godfather Inception Forrest Gump Alex 4 5 Bob 2 Tom May 19th, 2011@CCP A Toy Example The Godfather Inception Forrest Gump Alex 4 5 Bob 2 Tom Let’s look at a toy example to see what problem we are solving. Suppose three users Alex, Bob and Tom have rated some movies. For example, Alex rated The Godfather 4 point, and Forrest Gump 5 point. Our target is to fill in the missing values in the matrix. ? ? ? 1: Strong dislike, 2: Dislike, 3: It’s OK, 4: Like, 5: Strong like
May 19th, 2011@CCP Challenge Rating matrix is very sparse, density of ratings in commercial recommender system is less than 1% Performance deteriorates when rating matrix becomes sparse Here is the challenge for this problem. Rating matrix is very sparse, it is reported that density of ratings in commercial recommender system is less than 1%. In addition, performance of collaborative filtering approaches deteriorates when rating matrix becomes sparse.
Problem Task: Predicting the missing values Fact: May 19th, 2011@CCP Problem Task: Predicting the missing values Fact: Ratings reflect users’ preferences Challenge: Rating matrix is very sparse, only use rating information not enough Thought: Whether there exists contextual information that can also reflect users’ judgments? How can we utilize that kind of contextual information to improve the prediction quality? User-item rating matrix i1 i2 i3 i4 i5 u1 3 5 2 ? ? u2 We formally define the problem. Given the user-item rating matrix, our task is to predict the missing values. We know the fact that ratings reflect users’ preferences. However, rating matrix is very sparse, only use rating information is not enough. So we have the thought that whether there exists contextual information that can also reflect users’ judgments? If it exists, how can we utilize that kind of contextual information to improve the prediction quality? ? 4 ? 4 ? u3 3 4 1 ? ? u4 ? ? ? 3 5 u5 ? 5 ? 4 ?
May 19th, 2011@CCP Motivation Social tagging is to collaboratively creating and managing tags to annotate and categorize content Tags can represent users’ judgments and interests about Web contents quite accurately From previous research work, we find that social tagging is to collaboratively creating and managing tags to annotate and categorize content. In addition, tags can represent users’ judgments and interests about Web contents quite accurately.
Motivation Rating: preference User Item Tagging: interest May 19th, 2011@CCP Motivation Rating: preference User Item Thus, we already know that rating reflects users’ preference on items, and we also find that tagging could also represent users’ interest. To improve the recommendation quality and tackle the data sparsity problem, we propose to fuse tagging and rating information together. Tagging: interest To improve the recommendation quality and tackle the data sparsity problem, fuse tagging and rating information together
Intuition of Matrix Factorization May 19th, 2011@CCP Intuition of Matrix Factorization n l n M UT V l m * m We use matrix factorization framework. The intuition of matrix factorization is as follows, we use two low-rank matrices to approximate original matrix. Physical meaning of each row in U and V is a latent semantic dimension. E.g., action, comedy, if M is a user-movie rating matrix. The reason we use low-rank matrix factorization instead of full-rank matrix factorization is that, M is a partially observed matrix, if we use full-rank matrix factorization, there will be over fitting problem.
User-Item Rating Matrix Factorization May 19th, 2011@CCP User-Item Rating Matrix Factorization Conditional distributions over the observed i1 i2 i3 i4 i5 u1 3 5 2 u2 4 u3 1 u4 u5 U: user latent feature matrix. V: item latent feature matrix. UiTVj: predicted rating (user i to item j). We employ the user-item rating matrix factorization. We first write the conditional distributions over the observed ratings. Then, Zero-mean spherical ['sfɪərɪkəl] Gaussian priors are placed on the user latent feature matrix and the item latent feature matrix. Finally, we derive the posterior distributions of U and V based only on observed ratings. User-Item Rating Matrix R Zero-mean spherical Gaussian priors are placed on the user latent feature matrix and the item latent feature matrix Posterior distributions of U and V based only on observed ratings
User-Tag Tagging Matrix Factorization May 19th, 2011@CCP User-Tag Tagging Matrix Factorization Conditional over the observed tagging data t1 t2 t3 t4 t5 u1 4 32 5 u2 u3 3 33 12 u4 u5 U: user latent feature matrix, T: tag latent feature matrix. UiTTk: predicted value of the model. The idea of user-tag tagging matrix factorization is that users’ usage behavior on tags reflect their preferences on these tags, and indicate their interests on concept represented by these tags. For example, a user Jack uses the tag action 20 times, the tag animation 20 times, and the tag romantic 1 time, it means that he like action and animation movies, and do not like romantic movies. We first write the conditional distribution over the observed tagging data, and then derive the posterior distributions of U and T. If asked then answer like this: The reason we assume a Gaussian prior on U and T is that we think the frequencies of tags follow a Gaussian distribution similar to the user’s rating style. Experimental results also confirm the effectiveness of this assumption. User-Tag Tagging Matrix C Posterior distributions of U and T Jack: action (20), animation (20), romantic (1)
Item-Tag Tagging Matrix Factorization May 19th, 2011@CCP Item-Tag Tagging Matrix Factorization t1 t2 t3 t4 t5 i1 14 20 15 i2 4 i3 13 23 12 i4 5 i5 Titanic: romance (20), bittersweet (20), action (1) The idea of item-tag tagging matrix factorization is that tags are users’ judgments on items, and the frequencies one item is annotated with one tag can indicate how strong this tag could represent this item. For example, the movie Titanic is annotated with 20 times romance, 20 times bittersweet, and 1 time action. It means that romance and bittersweet could better capture this movie. We derive the posterior distributions of V and T. Item-Tag Tagging Matrix D Posterior distributions of V and T
TagRec Framework U User latent feature matrix V May 19th, 2011@CCP TagRec Framework U User latent feature matrix V Item latent feature matrix T Tag latent feature matrix R User-item rating matrix C User-tag tagging matrix D Item-tag tagging matrix In summary, this is the unified matrix factorization framework referred to as TagRec. In this figure, shaded circles are observed data. Un-shaded circles are latent variables.
Experimental Analysis May 19th, 2011@CCP Experimental Analysis MovieLens 10M/100K data set: Provided by GroupLens research Online movie recommender service MovieLens (http://movielens.umn.edu) Statistics: Ratings: 10,000,054 Tags: 95,580 Movies: 10,681 Users: 71,567 Let’s look at the experimental analysis. We use movielens data set, and the statistics are as follows. There are about 10 million ratings, 95 thousand tags, 10 thousand movies, and 71 thousand users.
Experimental Analysis May 19th, 2011@CCP Experimental Analysis MAE comparison with other approaches (a smaller MAE means better performance) This is a table of MAE comparison with other approaches, and a smaller MAE means better performance. We compare with user mean, item mean, SVD, and PMF methods. Training data means percentage of data used in training. Dimensionality means number of latent dimensions used in the approach. We can find SVD, PMF, TagRec perform better than baseline methods. Our method TagRec performs the best among all approaches. UMEAN: mean of the user’s ratings IMEAN: mean of the item’s ratings SVD: A well-know method in Netflix competition PMF: Salakhutdinov and Mnih in NIPS’08
Experimental Analysis May 19th, 2011@CCP Experimental Analysis RMSE comparison with other approaches (a smaller RMSE value means a better performance) This is the result of RMSE metric, and we could find similar trend with MAE.
Contribution of Chapter 3 May 19th, 2011@CCP Contribution of Chapter 3 Propose a factor analysis approach, referred to as TagRec, by utilizing both users’ rating information and tagging information based on probabilistic matrix factorization Overcome the data sparsity problem and non-flexibility problem confronted by traditional collaborative filtering algorithms In summary, the contribution of chapter 3 includes two parts. Firstly, we propose a factor analysis approach, referred to as TagRec, by utilizing both users’ rating information and tagging information based on probabilistic matrix factorization. Secondly, we overcome the data sparsity problem and non-flexibility problem confronted by traditional collaborative filtering algorithms.
Item Recommendation with Tagging Ensemble May 19th, 2011@CCP Introduction Background Item Recommendation with Tagging Ensemble Now I will talk about user recommendation via interest modeling User Recommendation via Interest Modeling Item Suggestion with Semantic Analysis Item Modeling via Data-Driven Approach Conclusion and Future Work 35
Structure of Thesis Social Media User Item Item User Item May 19th, 2011@CCP Structure of Thesis Consumption Goods Social Media Participant User Item This is the 2nd work in the thesis presented in chapter 4. The target is to recommend users with similar interests to an active user. Item User Item Characteristic Chap. 3 Chap. 4 Chap. 5 Chap. 6
Problem and Motivation May 19th, 2011@CCP Problem and Motivation Social Tagging System In social tagging systems, users could add tags to resources, for example, URLs.
Problem and Motivation May 19th, 2011@CCP Problem and Motivation Tagging: Judgments on resources Users’ personal interests Previous research work finds that tagging expresses users’ judgments on resources, and also reflects users’ personal interests
Problem and Motivation May 19th, 2011@CCP Problem and Motivation Providing an automatic interest-based user recommendation service The target of this work is to provide an automatic interest-based user recommendation service. So that a user could find other users with similar interests easily.
Challenge How to model users’ interests? May 19th, 2011@CCP Challenge How to model users’ interests? How to perform interest-based user recommendation? Bob What is the challenge of this work? The 1st challenge is how to model users’ interests? The 2nd challenge is how to perform interest-based user recommendation? Alex Tom Grey
UserRec: User Interest Modeling May 19th, 2011@CCP UserRec: User Interest Modeling Triplet: user, tag, resource Observations of tagging activities: Frequently used user tags can be utilized to characterize and capture users’ interests If two tags are used by one user to annotate one URL at the same time, it is very likely that these two tags are semantically related URL http://www.nba.com Tags of user 1 Basketball, nba Tags of user 2 Sports, basketball, nba We propose a user recommendation framework to tackle the problem, and it is referred to as UserRec. Let’s discuss user interest modeling first. In social tagging systems, user, tag, resource form a triplet ['trɪ plət]. For example, user 1 tag nba.com with basketball, nba, and user 2 tag nba.com with sports, basketball, and nba. We have observations of tagging activities as follows: Frequently used user tags can be utilized to characterize and capture users’ interests. If two tags are used by one user to annotate one URL at the same time, it is very likely that these two tags are semantically related
UserRec: User Interest Modeling May 19th, 2011@CCP UserRec: User Interest Modeling User Interest Modeling: Generate a weighted tag-graph for each user Employ a community discovery algorithm in each tag-graph In user interest modeling, we first generate a weighted tag-graph for each user. Then, we employ a community discovery algorithm in each tag-graph.
UserRec: User Interest Modeling May 19th, 2011@CCP UserRec: User Interest Modeling This is a weighted tag graph generated by our approach. If a line connects two tags, it means these two tags are used together by this user to annotate a resource.
UserRec: User Interest Modeling May 19th, 2011@CCP UserRec: User Interest Modeling Generate a weighted tag-graph for each user: http://espn.go.com basketball, nba, sports http://msn.foxsports.com http://www.ticketmaster.com sports, music http://freemusicarchive.org music, jazz, blues http://www.wwoz.org This is an example of how we generate the weighted tag-graph. Basketball and nba are used two times together, the frequency count in the tag-graph is 2. Sports and music are used 1 time together, the frequency count is 1. tag-graph
UserRec: User Interest Modeling May 19th, 2011@CCP UserRec: User Interest Modeling Employ community discovery in tag-graph Optimize modularity If the fraction of within-community edges is no different from what we would expect for the randomized network, then modularity will be zero Nonzero values represent deviations from randomness After generating tag-graph for each user, we employ a community discovery algorithm in each tag-graph. This algorithm optimizes modularity. If the fraction of within-community edges is no different from what we would expect for the randomized network, then modularity will be zero. Nonzero values represent deviations from randomness. Given a tag-graph like this, we will have two communities. two communities tag-graph
Interest-based User Recommendation May 19th, 2011@CCP Interest-based User Recommendation Representing topics of user with a random variable Each community discovered is considered as a topic A topic consists of several tags Importance of a topic is measured by the sum of number of times each tag is used in this topic Employ maximum likelihood estimation to calculate the probability value of each topic of a user A Kullback-Leibler divergence (KL-divergence) based method to calculate the similarity between two users based on their topics’ probability distributions Just read this page.
Experimental Analysis May 19th, 2011@CCP Experimental Analysis Data Set: Delicious Statistics: We conduct our experiments on Delicious data set. This is the statistics of the data set.
Experimental Analysis May 19th, 2011@CCP Experimental Analysis Memory-based collaborative filtering methods: Person correlation coefficient (PCC) PCC-based similarity calculation method with significance weighting Model-based collaborative filtering methods: Probabilistic matrix factorization Singular value decomposition After deriving the latent feature matrices, we still need to use memory-based approaches on derived latent feature matrices: SVD-PCC, SVD-PCCW, PMF-PCC, PMF-PCCW We compare proposed method with memory-based collaborative filtering methods and model-based methods.
Experimental Analysis May 19th, 2011@CCP Experimental Analysis Comparison with approaches those are based on URLs (a larger value means a better performance for each metric) From these two red rectangles, we could see that using the same method, utilizing tags could achieve better performance than using URLs. This confirms that tags could capture users’ interests well. From the blue rectange, we could find proposed UserRec outperforms other approaches. Comparison with approaches those are based on Tags (a larger value means a better performance for each metric)
Contribution of Chapter 4 May 19th, 2011@CCP Contribution of Chapter 4 Propose the User Recommendation (UserRec) framework for user interest modeling and interest-based user recommendation Provide users with an automatic and effective way to discover other users with common interests in social tagging systems In chapter 4, we propose the User Recommendation (UserRec) framework for user interest modeling and interest-based user recommendation. We provide users with an automatic and effective way to discover other users with common interests in social tagging systems.
Item Recommendation with Tagging Ensemble May 19th, 2011@CCP Introduction Background Item Recommendation with Tagging Ensemble Now, let’s look at the work “item suggestion with semantic analysis”. User Recommendation via Interest Modeling Item Suggestion with Semantic Analysis Item Modeling via Data-Driven Approach Conclusion and Future Work 51
Structure of Thesis Social Media User Item Item User Item May 19th, 2011@CCP Structure of Thesis Consumption Goods Social Media Participant User Item This work is introduced in chapter 5. Item User Item Characteristic Chap. 3 Chap. 4 Chap. 5 Chap. 6
Problem and Motivation May 19th, 2011@CCP Problem and Motivation Social media systems with Q&A functionalities have accumulated large archives of questions and answers Online Forums Community-based Q&A services As we know, social media systems with Q&A functionalities have accumulated large archives of questions and answers, examples include online forums, community-based question answering services.
Problem and Motivation May 19th, 2011@CCP Problem and Motivation Query: Q1: How is Orange Beach in Alabama? Question Search: Q2: Any ideas about Orange Beach in Alabama? Question Suggestion: Q3: Is the water pretty clear this time of year on Orange Beach? Q4: Do they have chair and umbrella rentals on Orange Beach? Usually, these systems provide question search, which could find semantically equivalent questions. In this work, 1, Our motivation is to also suggest semantically related questions. 2, For example, give a queried question “How is Orange Beach in Alabama?” We also would like to suggest following questions: … 3, The reason is that these questions all talk about the topic “Travel in Orange Beach”, but they are from different angles. Topic: travel in orange beach
May 19th, 2011@CCP Results of Our Model Why can people only use the air phones when flying on commercial airlines, i.e. no cell phones etc.? Results of our model: Why are you supposed to keep cell phone off during flight in commercial airlines? (Semantically equivalent) Why don’t cell phones from the ground at or near airports cause interference in the communications of aircraft? (Semantically related) Cell phones and pagers really dangerous to avionics? (Semantically related) I will present results of our model first to motivate this application. Given the queried question “Why can people only use the air phones when flying on commercial airlines, i.e. no cell phones etc.?” The first result is …. It is semantically equivalent to the queries question, thus, the best answer of the first result should answer the user’s information need accurately. The second result is … is quite related to the queried question. The 3rd result is… It is also semantically related. All three results belong to the topic of “interference of aircraft”. Interference of aircraft
Problem and Motivation May 19th, 2011@CCP Problem and Motivation Benefits Explore information needs from different aspects “Travel”: beach, water, chair, umbrella Increase page views Enticing users’ clicks on suggested questions Relevance feedback mechanism Mining users’ click through logs on suggested questions Helping users explore their information needs thoroughly from different aspects Increasing page views by enticing users’ clicks on suggested questions Providing forums a relevance feedback mechanism by mining users click through logs on suggested questions
Challenge Traditional bag-of-words approaches suffer from the shortcoming that they could not bridge the lexical chasm between semantically related questions
Document Representation May 19th, 2011@CCP Document Representation Document representation Bag-of-words Independent Fine-grained representation Lexically similar Topic model Assign a set of latent topic distributions to each word Capturing important relationships between words Coarse-grained representation Semantically related Two types of methods are typically used to represent the content of text documents. One is bag-of-words representation, which means that words are assumed to be independently. A bag-of-words model is a fine grained representation. The advantage is that it could help find lexically similar questions. Topic model assigns a set of latent topic distributions to each word by capturing important relationships between words. Compared with bag-of-words representation, it is considered as a coarse-grained representation. The advantage is it could help find semantically related questions.
TopicTRLM in Online Forum May 19th, 2011@CCP TopicTRLM in Online Forum TopicTRLM Topic-enhanced Translation-based Language Model 1, To achieve the goal of adopting both bag-of-words and topic model representations, we propose the TopicTRLM model. It fuses the latent topic information with lexical information to measure the semantic relatedness between two questions systematically. 2, This is the system framework. (1) It contains two modules, offline module and online module. (2) In offline module, we first employ question detector to find all questions in each thread. (3) On one hand, we build a parallel corpus using questions detected from forums. (4) We learn word translation probabilities from the parallel corpus. (5) On the other hand, we train topic models using detected questions. (6) In the online module, when a query comes, we fuse signals from both topic model and lexical analysis to suggest semantically related questions.
TopicTRLM in Online Forum May 19th, 2011@CCP TopicTRLM in Online Forum q: a query, D: a candidate question w: a word in query : parameter balance weights of BoW and topic model Jelinek-Mercer smoothing TRLM score: BoW LDA score: topic model 1, Following is the formulation of TopicTRLM. In this formula, q is a query, D is a candidate question, w is a word in query. \gamma is a parameter to balance weights of bag-of-words and topic model. 2, The left part is translation-based language model score, it is a score based on bag-of-words representation. The right part is LDA score, it is based on topic-model representation. 3, We employ Jelinek-Mercer smoothing to fuse the TRLM score with LDA score. A larger \gamma means we would like to find more lexically related questions for the query, and a smaller \gamma would emphasize more on two questions’ latent topic distributions’ similarity. 4, When we set \gamma=0, TopicTRLM only employs latent topic analysis, when we set \gamma=1, it employs lexical analysis. Thus, TopicTRLM is a generalization of both lexical analysis and latent topic analysis in the question suggestion task.
TopicTRLM in Online Forum May 19th, 2011@CCP TopicTRLM in Online Forum TRLM C: question corpus, : Dirichlet smoothing parameter T(w|t): word to word translation probabilities Use of LDA K: number of topics, z: a topic 1, we use Dirichlet smoothing to balance the collection smoothing and empirical data. 2, TRLM fuses scores from both language model and translation model, and \beta is the parameter to balance both parts. T(w|t) is the translation probability from source word to to target word w. An essential part of TRLM is to learn the word to word translation probabilities. 3, LDA is a well known topic model method, and we employ it to measure the semantically relatedness between a word and a candidate question.
TopicTRLM in Online Forum May 19th, 2011@CCP TopicTRLM in Online Forum Estimate T(w|t) IBM model 1, monolingual parallel corpus Questions are focus of forum discussions, questions posted by a thread starter (TS) during the discussion are very likely to explore different aspects of a topic Build parallel corpus Extract questions posted by TS, question pool Q Question-question pairs, enumerating combinations in Q Aggregating all q-q pairs from each forum thread Estimate T(w|t) is an essential part to TRLM. IBM model 1 is employed to learn the translation probabilities, and a monolingual parallel corpus is needed. When we look into forum data, we find questions are usually the focus of forum discussions. Questions posted by a thread start during each thread are very likely to explore different aspects of a topic. Thus, we leverage questions in forum threads to build the parallel corpus Firstly, we extract questions posted by the thread starter in each thread, and create a question pool Q for each thread. We construct question-question pairs by enumerating all possible combinations in Q. Then we aggregate all q-q pairs from each thread.
TopicTRLM-A in Community-based Q&A Best answer for each resolved question in community-based Q&A services is always readily available Best answer of a question could also explain the semantic meaning of the question Propose TopicTRLM-A to incorporate answer information
TopicTRLM-A in Community-based Q&A May 19th, 2011@CCP TopicTRLM-A in Community-based Q&A Different from proposed TopicTRLM model, in TopicTRLM-A, we propose new method to construct parallel corpus. We combine question analysis with answer analysis together in lexical analysis.
Experiments in Online Forum May 19th, 2011@CCP Experiments in Online Forum Data set Crawled from TripAdvisor TST_LABEL: labeled data for 268 questions TST_UNLABEL: 10,000 threads at least 2 questions posted by thread starters TRAIN_SET: 1,976,522 questions, 971,859 threads Parallel corpus to learn T(w|t) LDA training data Question repository 1, for evaluation purpose, we crawled data from the travel forum TripAdvisor. TripAdvisor is a popular online travel forum. 2, we divide data set into three parts. The 1st part is labeled data, the 2nd part is 10 thousand threads that contain at least 2 questions posted by thread starters. The 3rd is training data set contains less than 2 million questions from 971 thousand threads. 3, TRAIN_SET is used for 3 purposes, 1st is is used to learn word to word translation probabilities, 2nd it is used as LDA training data, 3rd it is used as a question repository.
Experiments in Online Forum Performance comparison (a larger value in metric means better performance) LDA performs the worst, coarse-grained TRLM > TR > QL TopicTRLM outperforms other approaches
Experiments in Community-based Q&A Date Set Yahoo! Answers “travel” category “computers & internet” category
Experiments in Community-based Q&A Performance of different models on category “computers & internet” (a larger metric value means a better performance)
Contribution of Chapter 5 Propose question suggestion, which targets at suggesting questions that are semantically related to a queried question Propose the TopicTRLM which fuses both the lexical and latent semantic knowledge in online forums Propose the TopicTRLM-A to incorporate answer information in community-based Q&A
Item Recommendation with Tagging Ensemble May 19th, 2011@CCP Introduction Background Item Recommendation with Tagging Ensemble Here is the last work in thesis. User Recommendation via Interest Modeling Item Suggestion with Semantic Analysis Item Modeling via Data-Driven Approach Conclusion and Future Work 70
Structure of Thesis Social Media User Item Item User Item May 19th, 2011@CCP Structure of Thesis Consumption Goods Social Media Participant User Item In chapter 6, we try to understand an item’s characteristic. Item User Item Characteristic Chap. 3 Chap. 4 Chap. 5 Chap. 6
Challenge of Question Analysis May 19th, 2011@CCP Challenge of Question Analysis Questions are ill-phrased, vague and complex Light-weight features are needed Lack of labeled data Our work focuses on question analysis. The challenge of question analysis is that Questions are ill-phrased, vague and complex. The other challenge is the lack of labeled data.
Problem and Motivation May 19th, 2011@CCP Problem and Motivation “Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available.” -- Alon Halevy, Peter Norvig and Fernando Pereira Motivated by the argument of Alon, Peter and Fernando that “Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available”. We look at the problem from a new perspective.
Problem and Motivation May 19th, 2011@CCP Problem and Motivation Social Signal rating commenting In community-based Q&A services, people do commenting, rating, voting, etc. These are all social signals. We think these rich social signals have community wisdom, and we wonder whether we could mine useful knowledge from these community wisdom. Knowledge Community wisdom voting
Problem and Motivation May 19th, 2011@CCP Problem and Motivation Whether we can utilize social signals to collect training data for question analysis with NO manual labeling Question Subjectivity Identification (QSI) Subjective Question One or more subjective answers What was your favorite novel that you read? Objective Question Authoritative answer, common knowledge or universal truth What makes the color blue? Specifically, we think the problem that Whether we can utilize social signals to collect training data for question analysis with NO manual labeling. As a test case, we work on the problem of question subjectivity identification. We need to identify whether a question is subjective or objective. Read remaining content.
Social Signal Like: like an answer if they find the answer useful May 19th, 2011@CCP Social Signal Like: like an answer if they find the answer useful Subjective Answers are opinions, different tastes Best answer receives similar number of likes with other answers Objective Like an answer which explains universal truth in the most detail Best answer receives higher likes than other answers The 1st social signal is like. People like an answer if they find the answer useful. The intuition is that, for subjective questions, Answers are opinions, due to different tastes, Best answer receives similar number of likes with other answers. But for objective question, people like an answer which explains universal truth in the most detail. So Best answer receives higher likes than other answers.
Social Signal Vote: users could vote for best answer Subjective May 19th, 2011@CCP Social Signal Vote: users could vote for best answer Subjective Vote for different answers, support different opinions Low percentage of votes on best answer Objective Easy to identify answer contains the most fact Percentage of votes of best answer is high users could vote for best answer using vote signal. For subjective questions, people may Vote for different answers, support different opinions. So there may be Low percentage of votes on best answer. For objective questions, it is Easy to identify answer contains the most fact. So the Percentage of votes of best answer is high.
Social Signal Source: references to authoritative resources May 19th, 2011@CCP Social Signal Source: references to authoritative resources Only available for objective question that has fact answer Poll and Survey User intent is to seek opinions Very likely to be subjective Source signal appears when a user references to authoritative resources. We think it is only available for objective question that has fact answer. In poll and survey category, user intent is mainly to seek opinions, so they are very likely to be subjective.
May 19th, 2011@CCP Social Signal Answer Number: the number of posted answers to each question varies Subjective Post opinions even they notice there are other answers Objective May not post answers to questions that have received other answers since an expected answer is usually fixed A large answer number indicates subjectivity HOWEVER, a small answer number may be due to many reasons, such as objectivity, small page views Answer number means the number of posted answers to each question. For subjective questions, people may post opinions even they notice there are other answers. But for objective questions, people may not post answers since an expected answers is usually fixed. So a large answer number indicates subjectivity, but a small answer number may be due to many reasons, such as objectivity, small page views.
Feature Word Word n-gram Question Length Request Word May 19th, 2011@CCP Feature Word Word n-gram Question Length Request Word Subjectivity Clue Punctuation Density Grammatical Modifier Entity We also propose several light-weight features, details could be found in the thesis.
May 19th, 2011@CCP Experiments Dataset Yahoo! Answers, 4,375,429 questions with associated social signals Ground truth: adapted from Li, Liu and Agichtein 2008 Experiments are conducted on Yahoo! Answers data set, and the ground truth is adapted from a previous work.
Experiments CoCQA utilizes some amount of May 19th, 2011@CCP Experiments CoCQA utilizes some amount of unlabeled data, but it could only utilize a small amount (3, 000 questions) Effectiveness of collecting training data using well-designed social signals This is the result of experiments. From this table, we could find that CoCQA, a semi-supervised approach, performs better than the supervised approach, since it utilizes some amount of unlabeled data. This table also confirms the effectiveness of collecting training data using well-designed social signals. Moreover, These social signals could be found in almost all CQA sites. These social signals could be found in almost all CQA
Experiments Better performance using word n-gram compared with word May 19th, 2011@CCP Experiments This table compares using word and word n-gram using the same approach. We could see that we could achieve better performance using word n-gram compared with word. In addition, social signals achieve on average 12.27% relative gain because they use the largest amount of training data. Supervised, manually labeling, sparsity CoCQA, use several thousand unlabeled data, tackled sparsity to some extent Training data collected by social signals, large amount, better solve data sparsity problem Better performance using word n-gram compared with word Social signals achieve on average 12.27% relative gain
May 19th, 2011@CCP Experiments This table shows that Adding any heuristic feature to word n-gram improve precision. Combining heuristic feature and word n-gram achieves 11.23% relative gain over n-gram Adding any heuristic feature to word n-gram improve precision Combining heuristic feature and word n-gram achieves 11.23% relative gain over n-gram
Contribution of Chapter 6 May 19th, 2011@CCP Contribution of Chapter 6 Propose an approach to collect training data automatically by utilizing social signals in community-based Q&A sites without involving any manual labeling Propose several light-weight features for question subjectivity identification Just read the content
Item Recommendation with Tagging Ensemble May 19th, 2011@CCP Introduction Background Item Recommendation with Tagging Ensemble Let me conclude my thesis presentation. User Recommendation via Interest Modeling Item Suggestion with Semantic Analysis Item Modeling via Data-Driven Approach Conclusion and Future Work 86
Conclusion Modeling users’ interests with respect to their behavior, and recommending items or users they may be interested in TagRec UserRec Understanding items’ characteristics, and grouping items that are semantically related for better addressing users’ information needs Question Suggestion Question Subjectivity Identification
Future Work TagRec UserRec Question Suggestion Mine explicit relations to infer some implicit relations UserRec Develop a framework to handle the tag ambiguity problem Question Suggestion Diversity the suggested questions Question Subjectivity Identification Sophisticated features: semantic analysis
Publications: Conferences (7) Tom Chao Zhou, Xiance Si, Edward Y. Chang, Irwin King and Michael R. Lyu. A Data-Driven Approach to Question Subjectivity Identification in Community Question Answering. In Proceedings of the 26th AAAI Conference on Artificial Intelligence (AAAI-12), pp 164-170, Toronto, Ontario, Canada, July 22 - 26, 2012. Tom Chao Zhou, Michael R. Lyu and Irwin King. A Classification-based Approach to Question Routing in Community Question Answering. In Proceedings of the 21st International Conference Companion on World Wide Web, pp 783-790, Lyon, France, April 16 - 20, 2012. Tom Chao Zhou, Chin-Yew Lin, Irwin King, Michael R. Lyu, Young-In Song and Yunbo Cao. Learning to Suggest Questions in Online Forums. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI-11), pp 1298-1303, San Francisco, California, USA, August 7 - 11, 2011. Zibin Zheng, Tom Chao Zhou, Michael R. Lyu, and Irwin King. FTCloud: A Ranking-based Framework for Fault Tolerant Cloud Applications. In Proceedings of the 21st IEEE International Symposium on Software Reliability Engineering (ISSRE 2010), pp 398-407, San Jose CA, USA, November 1- 4, 2010.
Publications: Conferences (7) Tom Chao Zhou, Hao Ma, Michael R. Lyu, Irwin King. UserRec: A User Recommendation Framework in Social Tagging Systems. In Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI-10), pp 1486-1491, Atlanta, Georgia, USA, July 11 - 15, 2010. Tom Chao Zhou, Irwin King. Automobile, Car and BMW: Horizontal and Hierarchical Approach in Social Tagging Systems. In Proceedings of the 2ndWorkshop on SocialWeb Search and Mining (SWSM 2009), in conjunction with CIKM 2009, pp 25-32, Hong Kong, November 2 - 6, 2009. Tom Chao Zhou, Hao Ma, Irwin King, Michael R. Lyu. TagRec: Leveraging Tagging Wisdom for Recommendation. In Proceedings of the 15th IEEE International Conference on Computational Science and Engineering (CSE-09), pp 194199, Vancouver, Canada, 29-31 August, 2009.
Publications: Journals (2), Under Review (1) Zibin Zheng, Tom Chao Zhou, Michael R. Lyu, and Irwin King. Component Ranking for Fault-Tolerant Cloud Applications, IEEE Transactions on Service Computing (TSC), 2011. Hao Ma, Tom Chao Zhou, Michael R. Lyu and Irwin King. Improving Recommender Systems by Incorporating Social Contextual Information, ACM Transactions on Information Systems (TOIS), Volume 29, Issue 2, 2011. Under Review Tom Chao Zhou, Michael R. Lyu and Irwin King. Learning to Suggest Questions in Social Media. Submitted to Journal of the American Society for Information Science and Technology (JASIST).
Thanks! Q & A
FAQ FAQ: Chapter 3 FAQ: Chapter 4 FAQ: Chapter 5 FAQ: Chapter 6
FAQ: Chapter 3 An example of a recommender system MAE and RMSE equations Parameter sensitivity Tag or social network Intuition of maximize the log function of the posterior distribution in Eq. 3.10 of thesis Back to FAQ
An Example of A Recommender System Have some personal preferences. Get some recommendations. Back to FAQ
MAE and RMSE Mean absolute error (MAE) Root mean squared error (RMSE) Back to FAQ
Parameter Sensitivity May 19th, 2011@CCP Parameter Sensitivity The paramter theta_c controls the impact of the user-tag tagging matrix, and the parameter theta_d controls the impact of the item-tag tagging matrix. If we set theta_c and theta_d both as 0, it means we only consider users’ rating information; if we set theta_c and theta_d to inf, it means we only utilize users’ tagging information. The relative wide range of optimal parameter setting indicates the feasibility of parameter tuning. Back to FAQ
Tag or Social Network? What is the difference of incorporating tag information and social network information? Answer: both tagging and social networking could be considered as user behavior besides rating. They explain users’ preferences from different angles. The proposed TagRec framework could not only incorporate tag information, but also could utilize social network information in a similar framework. Back to FAQ
Intuition of maximize the log function of the posterior distribution in Eq. 3.10 of thesis The idea of maximize the log function of the posterior distributions is equivalent to maximize the posterior distributions directly, because the logarithm is a continuous strictly increasing function over the range of the likelihood. The reason why I would like to maximize the posterior distributions is that after Bayesian inference, I need to calculate the conditional distributions to get the posterior distributions, e.g.: p(R|U,V), R is the observed ratings, and U, V are parameters. To estimate the U, V, I use the maximum likelihood estimation to estimate the parameter space, thus I need to maximize the conditional distributions P(R|U,V). So this is the reason why I have to maximize the log function in my approach Back to FAQ
FAQ: Chapter 4 What is modularity? Comparison on Precision@N Comparison on Top-K accuracy Comparison on Top-K recall Distribution of number of users in network Distribution of number of fans of a user Relationship between # fans and # bookmarks Why we use the graph mining algorithm instead of some simple algorithms, e.g. frequent mining Back to FAQ
What is Modularity? The concept of modularity of a network is widely recognized as a good measure for the strength of the community structure if node i and node j belong to the same community Back to FAQ
Comparison on Precision@N Back to FAQ
Comparison on Top-K Accuracy Back to FAQ
Comparison on Top-K Recall Back to FAQ
Distribution of Number of Users in Network Back to FAQ
Distribution of Number of Fans of A User Back to FAQ
Relationship Between # Fans, # bookmarks May 19th, 2011@CCP Relationship Between # Fans, # bookmarks The role of users with extremely large number of bookmarks is very similar to the role of Web portals on the Internet, or called hubs. Back to FAQ
Why we use the graph mining algorithm instead of some simple algorithms, e.g. frequent itemset mining We use community discovery algorithm on each tag-graph, and could accurately capture users’ interests on different topics. The algorithm is efficient, and the complexity is O(nlog2n). While frequent itemset mining is suitable for mining small itemset, e.g., 1, 2, 3 items in each set. However, each topic could contain many tags. Back to FAQ
FAQ: Chapter 5 Experiments on word translation Dirichlet smoothing Build monolingual parallel corpus in community-based Q&A An example from Yahoo! Answers Formulations of TopicTRLM-A Data Analysis in online forums Performance on Yahoo! Answers “travel” Back to FAQ
Experiments on Word Translation May 19th, 2011@CCP Experiments on Word Translation Word translation IBM 1: semantic relationships of words from semantically related questions LDA: co-occurrence relations in a question 1, the first row shows the source words. Top 10 words that are most semantically related to the source word are presented according to IBM translation model 1 and LDA. All words are lowercased and stemmed. 2, When a user is asking a question about shore, snorkel is related because snorkelling is a popular activity in shore, and condo is also related because the user also needs to rent a condo for living. IBM model 1 mainly captures the semantic relationships of words from semantically related questions. 3, LDA describes “co-occurrence” relations because it considers words in a question. For example, people ask questions like “Is there any grocery store at Orange Beach?” LDA is capable of capturing this kind of word relations between “grocery” and “shore” in a sentence. Back to FAQ
Dirichlet Smoothing Bayesian smoothing using Dirichlet priors A language model is a multinomial distribution, for which the conjugate prior for Bayesian analysis is the Dirichlet distribution Choose the parameters of the Dirichlet to be Then the model is given by Back to FAQ
Build Monolingual Parallel Corpus in Community-based Q&A Aggregate question title and question detail as a monolingual parallel corpus Back to FAQ
An Example from Yahoo! Answers May 19th, 2011@CCP An Example from Yahoo! Answers This is a snapshot from Yahoo! Answers, a famous community-based Q&A site. We could find the best answer is available for a resolved question. Best answer available Back to FAQ
TopicTRLM-A in Community-based Q&A Lexical score Latent semantic score Back to FAQ
TopicTRLM-A in Community-based Q&A Dirichlet smoothing Question LM score Answer ensemble Question translation model score Back to FAQ
Data Analysis in Online Forums May 19th, 2011@CCP Data Analysis in Online Forums Data Analysis Post level Forum discussions are quite interactive Power law # Threads # Threads that have replied posts from TS Average # replied posts from TS 1,412,141 566,256 1.9 1, We first conduct a post level analysis on thread starters’ activities. We can find that on average thread starters replied 1.9 posts to the thread he or she initiated. This indicate forum discussions are quite interactive. 2, We also plotted the distribution of replied posts from thread starter. This distribution follows a power law distribution, and this is the first time the power law distribution of thread starters’ activities is reported. Back to FAQ
Performance on Yahoo! Answers “travel” Performance of different models on category “travel” (a larger metric value means a better performance) Back to FAQ
FAQ: Chapter 6 Examples of subjective, objective questions Benefits of performing question subjectivity identification How to define subjective and object questions Back to FAQ
Examples of Subjective,Objective Questions May 19th, 2011@CCP Examples of Subjective,Objective Questions Question subjectivity identification Subjective What was your favorite novel that you read? What are the ways to calm myself when flying? Objective When and how did Tom Thompson die? He is one of the group of Seven. What makes the color blue? In this work, we study the problem of question subjectivity identification. So we need to identify whether a question is subjective or objective. A subjective question means the asker expects to collect opinions from other people. An object question means the asker expects to receive a single complete answer. Here are examples of subjective and objective questions. Back to FAQ
Benefits of Performing QSI More accurately identify similar questions Better rank or filter the answers Crucial component of inferring user intent Subjective question --> Route to users Objective question --> Trigger AFQA Back to FAQ
How to define subjective and object questions Ground truth data was created using Amazon’s Mechanical Turk service. Each question was judged by 5 qualified Mechanical Turk workers. Subjectivity was decided using majority voting Linguistic people are good at manual labeling Compute science people should focus on how to use existing data to identify subjective/objective questions, such as social signals, answers, etc. Not focus on manual labeling Back to FAQ