Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Content-based Recommendation Systems
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
The identification of interesting web sites Presented by Xiaoshu Cai.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Similar Document Search and Recommendation Vidhya Govindaraju, Krishnan Ramanathan HP Labs, Bangalore, India JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Spam Detection Ethan Grefe December 13, 2013.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Post-Ranking query suggestion by diversifying search Chao Wang.
Wenyuan Dai, Ou Jin, Gui-Rong Xue, Qiang Yang and Yong Yu Shanghai Jiao Tong University & Hong Kong University of Science and Technology.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
Improving Music Genre Classification Using Collaborative Tagging Data Ling Chen, Phillip Wright *, Wolfgang Nejdl Leibniz University Hannover * Georgia.
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Data Mining and Text Mining. The Standard Data Mining process.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Sentiment analysis algorithms and applications: A survey
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Information Retrieval
NAÏVE BAYES CLASSIFICATION
Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.
Presentation transcript:

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University Qiang Yang Hong Kong University of Science and Technology WWW2007

Introduction Unique characteristics of blogs Unique characteristics of blogs –Mainly maintained by individual persons and thus the contents are generally personal –The link structures between blogs generally form localized communities Ongoing research on blogs Ongoing research on blogs –Content based analysis –Blog communities’ evolution –Different kinds of tools to help users retrieve, organize and analyze the blogs

Introduction – Genres in Blog ’ s Content Affective Affective –The online diary by which people share their daily life publicly, express their feelings or thoughts or emotions through the blogs Informative Informative –Topic-oriented; the topic can be related to a hobby or the author’s profession or business

Introduction – the Problem and the Approach The problem The problem –Separating informative articles from affective articles in blogs. The approach The approach –Considering the problem as binary classification –Challenges The definitions of the informative articles and the affective articles The definitions of the informative articles and the affective articles The training corpus for both categories The training corpus for both categories The machine learning algorithm The machine learning algorithm

Introduction – Studies in the Weblog Space Emotion and topic classification of blog articles Emotion and topic classification of blog articles –To improve the effectiveness of emotion classification through filtering out informative articles Blog search Blog search –An intent-driven blog-search engine is proposed to resort the search results by considering their score of informative values. Automatic detection of high-quality blogs Automatic detection of high-quality blogs –To measure the quality of a blog by calculating the percentage of informative articles

Definition of Informative and Affective Articles A survey is done among the users who usually participate in the activities in blogs A survey is done among the users who usually participate in the activities in blogs Contents of informative articles include: Contents of informative articles include: –News that is similar to the news on traditional news websites –Technical descriptions, e.g. programming techniques –Commonsense knowledge –Objective comments on the events in the world Contents of affective articles include: Contents of affective articles include: –Diaries about personal affairs –Self-feelings or self-emotions descriptions

Algorithms Classification algorithms Classification algorithms –Naïve Bayes Classifier (NB) –Support Vector Machine (SVM) –Rocchio Classifier Feature selection algorithms Feature selection algorithms –Information Gain (IG) –χ 2 statistic (CHI)

Classification Algorithm – Na ï ve Bayes Classifier Laplace smoothing is applied to overcome the zero- frequency problem Laplace smoothing is applied to overcome the zero- frequency problem

Classification Algorithm – Rocchio Classifier Category profile based classifier Category profile based classifier where |c j | is the number of documents in the category c j and denotes document with terms weighted by TF-IDF

Feature Selection Algorithms Information Gain (IG) Information Gain (IG) χ 2 statistic (CHI) χ 2 statistic (CHI)

Experiment Data 5000 articles crawled from MSN space 5000 articles crawled from MSN space 3,547 of them are labeled as affective and 1,109 are labeled as informative while the others are filtered because of the encoding problem 3,547 of them are labeled as affective and 1,109 are labeled as informative while the others are filtered because of the encoding problem 2,200 articles from Sohu.com Directory as informative articles 2,200 articles from Sohu.com Directory as informative articles –News, commonsense knowledge or objective comments about 22 different topics Table 1. Statistics of Data Set

Experiment – Comparing Classification Algorithms Table 2. Performances of three classification algorithms

Comparing Feature Selection Algorithms Table 3. Performances on different features set

Representative Features Table 4. Top 20 representative features of each category

Study on Emotion and Topic Classification Assume that informative articles do not express personal emotions Assume that informative articles do not express personal emotions –Extracting affective articles can help to build a corpus with pure emotional articles Figure 1. Two-step approach for topic and emotion classification

Experiment on Emotion Classification Data Data –Training: 2,494 blog articles are manually labeled into two emotion tendencies, positive and negative –Testing: 1,303 articles from 75 blogs in MSN Space Table 5. Data set used for emotion classification

Experiment Result on Emotion Classification Before the binary emotion classifier, the information- affectiveness classification is used (I-Approach) or not (II-Approach) Before the binary emotion classifier, the information- affectiveness classification is used (I-Approach) or not (II-Approach) Table 6. Comparison results for two emotion classification approaches

Study on Intent-driven Weblog Search Engine Blog search is at the state of Web search currently Blog search is at the state of Web search currently Intent-driven search (re-rank) Intent-driven search (re-rank) Intent-driven search Intent-driven search S mixed = λ . S if + (1 - |λ|) . S origin where S if is a confidence value between -1 (strong affective intent) and 1 (strong informative intent), and S origin is the original relevance score

Analysis for the Distribution of Two Genres of Articles Figure 2. Distribution of informative articles and affective articles on 99,059 blog articles

Detecting High-quality Blogs Figure 3. Distribution of blogs with different levels of quality on 6,319 blogs

Conclusion and Future Work The task of separating informative and affective articles is addressed and considered as a binary classification task. The task of separating informative and affective articles is addressed and considered as a binary classification task. The applications of above information-affectiveness classification are studied, including emotion classification, intent-driven blog search and high- quality blogs detection. The applications of above information-affectiveness classification are studied, including emotion classification, intent-driven blog search and high- quality blogs detection. Future work: 1) building a much large data set by using semi-supervised learning techniques 2) applying the existing approach on the data in other languages Future work: 1) building a much large data set by using semi-supervised learning techniques 2) applying the existing approach on the data in other languages