Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1.

Slides:



Advertisements
Similar presentations
Sentiment Analysis on Twitter Data
Advertisements

Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
Measuring Reliability in Wikipedia Wen-Yuan Zhu
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Sarcasm Detection on Twitter A Behavioral Modeling Approach
Learning Within-Sentence Semantic Coherence Elena Eneva Rose Hoberman Lucian Lita Carnegie Mellon University.
Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
To trust or not, is hardly the question! Sai Moturu.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Sentiment Analysis of Social Media Content using N-Gram Graphs Authors: Fotis Aisopos, George Papadakis, Theordora Varvarigou Presenter: Konstantinos Tserpes.
Building Face Dataset Shijin Kong. Building Face Dataset Ramanan et al, ICCV 2007, Leveraging Archival Video for Building Face DatasetsLeveraging Archival.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
1 Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Speaker: Jun-Yi Zheng 2010/03/29.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
Date: 2013/8/27 Author: Shinya Tanaka, Adam Jatowt, Makoto P. Kato, Katsumi Tanaka Source: WSDM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Estimating.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Working With Wikis: Taking Student Organizations Online Frederic Murray, MLIS Instructional Services Librarian Al Harris Library
How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG,
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Chapter 23: Probabilistic Language Models April 13, 2004.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
National Taiwan University, Taiwan
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Comparative Experiments on Sentiment Classification for Online Product Reviews Hang Cui, Vibhu Mittal, and Mayur Datar AAAI 2006.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Using Linguistic Analysis and Classification Techniques to Identify Ingroup and Outgroup Messages in the Enron Corpus.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Open Wikipedia at any page in the language you want to contribute (English or Swedish). Press Create account (Skapa konto). Enter username and password.
Introduction to Wikis! More info:
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Language Model for Machine Translation Jang, HaYoung.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Language Identification and Part-of-Speech Tagging
A Simple Approach for Author Profiling in MapReduce
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Authorship Attribution Using Probabilistic Context-Free Grammars
Source: Procedia Computer Science(2015)70:
Wikitology Wikipedia as an Ontology
Presentation transcript:

Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1

Outline Introduction Related Work Our contribution Evaluation Conclusion 2

Outline Introduction Related Work Our contribution Evaluation Conclusion 3

Wikipedia’s Core Policies Can be edited by anyone Neutral-point- of-view Verifiability 4

Quality Control in Wikipedia Wikipedia editors and administrators Clean-up Tags 5

Wikipedia Articles with a promotional tone Wikipedia Article on Steve Angello – …Since then, he has exploded onto the house music scene… – …Steve Angello encompasses enough fame as a stand alone producer. Add astounding remixes for….his unassailable musical sights have truly made for an intense discography… 6

Wikipedia Articles with a promotional tone 7 Identified manually and tagged with an Cleanup message by Wikipedia editors

Outline Introduction Related Work Our contribution Evaluation Conclusion 8

Quality Flaw Prediction in Wikipedia (Anderka et al., 2012) Classifiers for ten most frequent quality flaws in Wikipedia One of the ten flaws is “Advert” => Written like an advertisement Majority of promotional Wikipedia articles “Advert” 9

Features Used in Classification (Anderka et al., 2012) Content-based features Structure features Network featuresEdit history features 10

Outline Introduction Related Work Our Approach – Motivation – Dataset Collection – Features – Classification Evaluation Conclusion 11

Style of Writing Our hypothesis - Promotional Articles could contain a distinct style of writing. Style of writing could be captured using – PCFG models – n-gram models 12

Our Approach Training PCFG models, character trigram models and word trigram models for the sets of promotional and non-promotional Wikipedia articles Compute features based on these models 13

Dataset Collection Positive Examples: 13,000 articles from English Wikipedia’s category, “Category:All articles with a promotional tone” (April 2013) Negative Examples: Randomly selected untagged articles (April 2013) 14

Training and Testing 70% of the data is used to train language models for each category of articles 30% of the data is used to train and test the classifiers for detecting promotional articles 15

Training N-gram Models For each categories of articles, we train – Word Trigram language models and – Character Trigram language models We also train a unigram word (BOW) model as a baseline for evaluation 16

N-gram Model Features Difference in the probabilities assigned to an article by the positive and the negative class character trigram language models Difference in the probabilities assigned to an article by the positive and the negative class word trigram language models 17

Training PCFG models (Raghavan et al., 2010; Harpalani et al., 2011) Promotional Articles Non- Promotional Articles Promotional Category Treebank Non- promotional Category Treebank Promotional PCFG model Non- Promotional Articles PCFG model 18 PCFG PARSING PCFG MODEL TRAINING

PCFG Model Features Calculate probabilities assigned to all sentences of an article by each of the two PCFG models Compute Mean, Maximum, Minimum and Standard Deviation of all probabilities, per PCFG model. Compute the difference in the values for these statistics assigned by the positive and negative class PCFG models 19

Classification LogitBoost with Decision Stumps (Friedman et al., 2000) 10-fold cross-validation 20

Outline Introduction Related Work Our contribution Evaluation Conclusion 21

Evaluation FeaturesPrecisionRecallF1AUC Bag-of-words Baseline PCFG Char. Trigram Word Trigram PCFG + Char. Trigram + Word Trigram Content and Meta Features (Anderka et al.) All Features

Evaluation FeaturesPrecisionRecallF1AUC Bag-of-words Baseline PCFG Char. Trigram Word Trigram PCFG + Char. Trigram + Word Trigram Content and Meta Features (Anderka et al.) All Features

Evaluation FeaturesPrecisionRecallF1AUC Bag-of-words PCFG Char. Trigram Word Trigram PCFG + Char. Trigram + Word Trigram Content and Meta Features (Anderka et al.) All Features

Evaluation FeaturesPrecisionRecallF1AUC Bag-of-words PCFG Char. Trigram Word Trigram PCFG + Char. Trigram + Word Trigram Content and Meta Features (Anderka et al.) All Features

Evaluation FeaturesPrecisionRecallF1AUC Bag-of-words PCFG Char. Trigram Word Trigram PCFG + Char. Trigram + Word Trigram Content and Meta Features (Anderka et al.) All Features

Top 10 Features (Based on Information Gain) 1.LM char trigram 2.LM word trigram 3.PCFG min 4.PCFG max 5.PCFG mean 6.PCFG std. deviation 7.Number of Characters 8.Number of Words 9.Number of Categories 10.Number of Sentences 27

Average Sentiment Score Average sentiment of all words in an article using SentiWordNet (Baccianella et al., 2010) Intuitively seems like a discriminative feature 18th most informative feature Reinforces our hypothesis that surface level features are insufficient 28

Conclusion Features based on n-gram language models and PCFG models work very well in detecting promotional articles in Wikipedia. Main advantages – – Depend on the article’s content only and not on external meta-data – Perform with high accuracy 29

Questions? 30

Content-based Features Number of characters, words, sentences Avg. Word Length Avg., min., max. Sentence Lengths, Ratio of max. to min. sentence lengths Ratio of long sentences (>48 words) to Short Sentences (<33 words) % of Sentences in the passive voice Relative Frequencies of POS tags % of sentences beginning with selected POS tags % of special phrases (e.g. editorializing terms like ‘without a doubt’, ‘of course’ ) % of easy words, difficult words, long words and stop words Overall Sentiment Score based on SentiWordNet * * Baccianella et al.,

Structure Features Number of Sections Number of Images Number of Categories Number of Wikipedia Templates used Number of References, Number of References per sentence and Number of references per section 32

Wikipedia Network Features Number of Internal Wikilinks (to other Wikipedia pages) Number of External Links (to other websites) Number of Backlinks (i.e. Number of wikilinks from other Wikipedia articles to an article) Number of Language Links (i.e. Number of links to the same article in other languages) 33

Edit History Features Age of the article Days since last revision of the article Number of edits to the article Number of unique editors Number of edits made by registered users and by anonymous IP addresses Number of edits per editor Percentage of edits by top 5% of the top contributors to the article 34