LOGO Finding High-Quality Content in Social Media Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis and Gilad Mishne (WSDM 2008) Advisor.

Slides:

Advertisements

Similar presentations

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Advertisements

Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.

DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.

Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.

COLLABORATIVE FILTERING Mustafa Cavdar Neslihan Bulut.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

Information Retrieval in Practice

Finding High-Quality Content in Social Media chenwq 2011/11/26.

Expertise Networks in Online Communities: Structure and Algorithms Jun Zhang Mark S. Ackerman Lada Adamic University of Michigan WWW 2007, May 8–12, 2007,

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

DEPARTMENT OF COMPUTER SCIENCE SOFTWARE ENGINEERING, GRAPHICS, AND VISUALIZATION RESEARCH GROUP 15th International Conference on Information Visualisation.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Overview of Search Engines

Quality-aware Collaborative Question Answering: Methods and Evaluation Maggy Anastasia Suryanto, Ee-Peng Lim Singapore Management University Aixin Sun.

Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)

Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.

Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.

Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Querying Structured Text in an XML Database By Xuemei Luo.

A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:

Understanding and Predicting Personal Navigation Date : 2012/4/16 Source : WSDM 11 Speaker : Chiu, I- Chih Advisor : Dr. Koh Jia-ling 1.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

A Language Independent Method for Question Classification COLING 2004.

Date: 2014/02/25 Author: Aliaksei Severyn, Massimo Nicosia, Aleessandro Moschitti Source: CIKM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Building.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,

Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er-Gang Liu Advisor : Dr. Jia-ling Koh 1.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor ： Dr. Koh Jia-Ling Speaker ： Tu.

IR, IE and QA over Social Media Social media (blogs, community QA, news aggregators)  Complementary to “traditional” news sources (Rathergate)  Grow.

Finding high-Quality contents in Social media BY : APARNA TODWAL GUIDED BY : PROF. M. WANJARI.

Algorithmic Detection of Semantic Similarity WWW 2005.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor ： Dr. Koh Jia-Ling Speaker ： Tu.

Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.

Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University SIGIR 2009.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Post-Ranking query suggestion by diversifying search Chao Wang.

Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Finding the Right Facts in the Crowd: Factoid Question Answering over Social Media J. Bian, Y. Liu, E. Agichtein, and H. Zha ACM WWW, 2008.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.

TO Each His Own: Personalized Content Selection Based on Text Comprehensibility Date: 2013/01/24 Author: Chenhao Tan, Evgeniy Gabrilovich, Bo Pang Source:

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Date: 2013/9/25 Author: Mikhail Ageev, Dmitry Lagun, Eugene Agichtein Source: SIGIR’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Improving Search Result.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

LOGO Comments-Oriented Blog Summarization by Sentence Extraction Meishan Hu, Aixin Sun, Ee-Peng Lim (ACM CIKM’07) Advisor ： Dr. Koh Jia-Ling Speaker ：

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.

CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.

Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.

Information Retrieval in Practice

Using Friendship Ties and Family Circles for Link Prediction

Presentation transcript:

LOGO Finding High-Quality Content in Social Media Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis and Gilad Mishne (WSDM 2008) Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ：

2 Outline  Introduction  Content Quality Analysis In Social Media  Modeling Content Quality In Community Question/Answering  Experimental Setting  Experimental Results  Conclusion

3 Introduction  From the early 2000s, user-generated content has become increasingly popular on the web, more and more users participate in content creation, rather than just consumption.  Popular user generated content (or social media) domains include blogs and web forums, social bookmarking sites, photo and video sharing communities.

4 Introduction  The main challenge posed by content in social media sites is the fact that the distribution of quality has high variance.  As the availability of such content increases, the task of identifying high-quality content in sites based on user contributions—social media sites— becomes increasingly important.

5 Introduction  The question/answering portals, in which users answer questions posed by other users, provide an alternative channel for obtaining information on the web.  Not just browsing results of search engines, users present detailed information needs and get direct responses authored by humans.  A user can vote for answers of other users, mark interesting questions, and even report abusive behavior.

6 Content Quality Analysis In Social Media  Evaluation of content quality is an essential module for performing more advanced information retrieval tasks on the question/answering system.  Exploiting features of social media that are intuitively correlated with quality, and then train a classifier to appropriately select and weight the features for each specific type of item, task, and quality definition.  Intrinsic content quality  User relationships  Usage statistics

7 Content Quality Analysis In Social Media  Intrinsic content quality  The intrinsic quality metrics focus on the text.  As a baseline, here using textual features only ： with all word n-grams up to length 5 that appear in the collection more than 3 times used as features.  Punctuation and typos ： capitalization rules may be ignored; excessive punctuation, or spacing may be irregular.  A particular form of low visual quality is misspellings and typos; additional features in the set quantify the number of out-of-vocabulary words and spelling mistakes.

8 Content Quality Analysis In Social Media  Intrinsic content quality (cont.)  Syntactic and Semantic Complexity ： these include simple proxies for complexity such as the average number of syllables per word or the entropy of word lengths.  Grammaticality ： using several linguistically-oriented features, like part-of-speech (POS) tags. “when | how | why to (verb)” (as in “how to identify... ”) →lower quality questions “when | how | why (verb) (personal pronoun) (verb)” (as in “how do I remove... ”)

9 Content Quality Analysis In Social Media  User relationships  A significant amount of quality information can be inferred from the relationships between users and items.  For example, it could apply to QA system, the main challenge is that the dataset, viewed as a graph ： nodes of multiple types. (e.g., questions, answers, users) edges represent a set of interaction among the nodes having different semantics. (e.g., “answers”, “gives best answer”, “votes for”, “gives a star to”)

10 Content Quality Analysis In Social Media  User relationships (cont.)  The resulting user-user graph is extremely rich and heterogeneous, and is unlike traditional graphs studied in the web link analysis setting.  However, still believing that traditional link analysis algorithm may provide useful evidence for quality classification, tuned for the particular domain.  Hence, for each type of link it performed a separate computation of each link-analysis algorithm, including hubs and authorities scores, the PageRank scores.

11 Content Quality Analysis In Social Media  Usage statistics  Usage statistics such as the number of clicks on the item and dwell time have been shown useful in the context of identifying high quality web search results, and are complementary to link-analysis based methods.  For example, all items within a popular category such as celebrity topics may receive orders of magnitude more clicks than, for instance, science topics. →it should be normalized by the item category

12 Content Quality Analysis In Social Media  Overall classification framework  Here casting the problem of quality ranking as a binary classification problem, in which a system must learn automatically to separate high quality content from the rest.  Experimented with several classification algorithms, the best performance among the techniques we tested was obtained with stochastic gradient boosted trees.

13 Modeling Content Quality In Community Question/Answering  Application-Specific User Relationships  The main forms of interaction among the users ： asking a question answering a question selecting best answer voting on an answer

14 Modeling Content Quality In Community Question/Answering  Application-Specific User Relationships (cont.)  Using multi-relational features to describe multiple classes of objects and multiple types of relationships between these objects.  Answer features ： the types of the data related to a particular answer form a tree, in which the type “Answer” is the root.

15 Modeling Content Quality In Community Question/Answering  Two subtrees starting from the answer being evaluated ： the types of features on the question subtree are ： the types of features on the user subtree are ：

16 Modeling Content Quality In Community Question/Answering  Application-Specific User Relationships (cont.)  Question features ： “Question” is the root.

17 Modeling Content Quality In Community Question/Answering  Application-Specific User Relationships (cont.)  Implicit user-user relations ：

18 Modeling Content Quality In Community Question/Answering  Application-Specific User Relationships (cont.)  Propagation-based metrics ： Pagerank score HITS hub score HITS authority score

19 Modeling Content Quality In Community Question/Answering  Content features for QA  Intuitively, a copy of a Wall Street Journal article about economy may have good quality, but would not (usually) be a good answer to a question about celebrity fashion.  To represent this, include the KL-divergence between the language models of the two texts, their non- stopword overlap, the ratio between their lengths, and other similar features.

20 Modeling Content Quality In Community Question/Answering  Usage features for QA  As a base set of content usage features we use the number of item views (clicks).

21 Experimental Setting  Dataset  The dataset consists of 6,665 questions and 8,366 question-answer pairs.  All of the above questions were labeled for quality by human editors, who were independent from the team that conducted this research.  Editors graded questions and answers for well- formedness, readability, utility, and interestingness; for answers, an additional correctness element was taken into account.

22 Experimental Setting  Dataset statistics  In the sample of users, most users participate as both “askers” and “answerers”.  In the evaluation dataset there is a positive correlation between question quality and answer quality.

23 Experimental Results  Question Quality

24 The 10 Most Significant Feature (Question)

25 Experimental Results  Answer Quality  The 10 most significant feature

26 The 10 Most Significant Feature (Answer)

27 Experimental Results

28 Conclusion  In this paper, it investigates methods for exploiting such community feedback to automatically identify high quality content.  Developing a comprehensive graph-based model of contributor relationships and combined it with content- and usage-based features.

29 Conclusion  This paper introduces a general classification framework for combining the evidence from different sources of information.  For the community question/answering domain, it shows that the system is able to separate high- quality items from the rest with an accuracy close to that of humans.