Predicting the controversy of a Wikipedia article

Slides:



Advertisements
Similar presentations
Request Dispatching for Cheap Energy Prices in Cloud Data Centers
Advertisements

SpringerLink Training Kit
Luminosity measurements at Hadron Colliders
From Word Embeddings To Document Distances
Choosing a Dental Plan Student Name
Virtual Environments and Computer Graphics
Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI
THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –
D. Phát triển thương hiệu
NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN
Điều trị chống huyết khối trong tai biến mạch máu não
BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.
Nasal Cannula X particulate mask
Evolving Architecture for Beyond the Standard Model
HF NOISE FILTERS PERFORMANCE
Electronics for Pedestrians – Passive Components –
Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel
L-Systems and Affine Transformations
CMSC423: Bioinformatic Algorithms, Databases and Tools
Some aspect concerning the LMDZ dynamical core and its use
Bayesian Confidence Limits and Intervals
实习总结 (Internship Summary)
Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,
Front End Electronics for SOI Monolithic Pixel Sensor
Face Recognition Monday, February 1, 2016.
Solving Rubik's Cube By: Etai Nativ.
CS284 Paper Presentation Arpad Kovacs
انتقال حرارت 2 خانم خسرویار.
Summer Student Program First results
Theoretical Results on Neutrinos
HERMESでのHard Exclusive生成過程による 核子内クォーク全角運動量についての研究
Wavelet Coherence & Cross-Wavelet Transform
yaSpMV: Yet Another SpMV Framework on GPUs
Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.
MOCLA02 Design of a Compact L-­band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Fuel cell development program for electric vehicle
Overview of TST-2 Experiment
Optomechanics with atoms
داده کاوی سئوالات نمونه
Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium  
ლექცია 4 - ფული და ინფლაცია
10. predavanje Novac i financijski sustav
Wissenschaftliche Aussprache zur Dissertation
FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,
Particle acceleration during the gamma-ray flares of the Crab Nebular
Interpretations of the Derivative Gottfried Wilhelm Leibniz
Advisor: Chiuyuan Chen Student: Shao-Chun Lin
Widow Rockfish Assessment
SiW-ECAL Beam Test 2015 Kick-Off meeting
On Robust Neighbor Discovery in Mobile Wireless Networks
Chapter 6 并发:死锁和饥饿 Operating Systems: Internals and Design Principles
You NEED your book!!! Frequency Distribution
Y V =0 a V =V0 x b b V =0 z
Fairness-oriented Scheduling Support for Multicore Systems
Climate-Energy-Policy Interaction
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Ch48 Statistics by Chtan FYHSKulai
The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.
Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs
Online Learning: An Introduction
Factor Based Index of Systemic Stress (FISS)
What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.
THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*
Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.
The Toroidal Sporadic Source: Understanding Temporal Variations
FW 3.4: More Circle Practice
ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف
Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM
Limits on Anomalous WWγ and WWZ Couplings from DØ
Presentation transcript:

Predicting the controversy of a Wikipedia article Taras Gritsenko

Overview Introduction Background work Theoretical analysis Experimental design/analysis Conclusion

Introduction Problem: Is it possible to observe controversy in Wikipedia articles using existing, publicly available data? What is controversial topic? A topic that incites discussion and gives rise to public disagreement No consensus Many types of topics are controversial: Politics (e.g most recent presidential campaign) Legislation (e.g gun control, medical marijuana, abortion..) .. Science and technology (has Science gone too far?)

Introduction With respect to Wikipedia, what characteristics can we isolate and analyze to determine properties of controversy? Over 5,275,000 articles… (somewhat difficult to quantify, seems small) Goes without saying: can’t know for certain whether or not something is objectively controversial, but can invent some construct and be reasonably certain Not exactly clear what this construct or threshold is Controversiality: 0 or 1, true or false

Background Work Identify aspects of Wikipedia articles that may indicate controversy Apply algorithms and…creative techniques Fidel Castro’s article on Wikipedia…is it controversial? Figure 1. The Wikipedia article corresponding to Cuban politician Fidel Castro (H-Index = 6, 299 comments).

Background Work Spent a good portion of time thinking about reasonable ways to approach this problem Problem: Where do I get a lot of metadata relating to Wikipedia articles? Naïve approach: download a distributed database dump that the Wikimedia foundation puts out downloading at 600kb/s (10.3GB file) Ironically I spent a week downloading this and it didn’t even contain talk page metadata All the download links for the talk page dump for Wikipedia were broken

Background Work Use the seemingly useless random article feature for random sampling We can visit potentially all of the articles on Wikipedia It’s easy to scrape statically structured webpages But what does all of this have to do with…controversy? Figure 2. A screenshot of the homepage of Wikipedia, featuring the random article feature

Background Work Wikipedia article ratings dataset from July 2011 – July 2012 Each rating contains a rating from 1-5, and a key (1-4) for each component:

Article rating distribution Ratings are potentially indicators of controversy The distribution of ratings can reveal whether or not a concept, topic, or anything, is controversial subject to what is being assessed “Like to dislike” ratio Theory: expect that as the ratio between positive and negative ratings approaches 1:1 that the topic is more controversial, No “convergence” or skewing in the dataset Since our rating data is potentially any value from 1-5, we can divide scores into “likes” and “dislikes” depending on some threshold

Article rating distribution How do we measure the degree to which data is skewed? A rating vector in 4-space needs to be converted into a rating in one space Simple: take the average between all 4 components of a rating yielding some value between 1-5 Given some threshold α, the rating is either placed in one bucket, or another Figure 3. A “controversial” distribution with an ra of 0.55.

Article rating distribution e.g. α = 3 (median from 1-5) 𝑟 𝑎 = # 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑟𝑎𝑡𝑖𝑛𝑔𝑠 # 𝑜𝑓 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑟𝑎𝑡𝑖𝑛𝑔𝑠 𝑟 𝑎 = 1 𝑟 𝑎 𝑖𝑓 𝑟 𝑎 >1 Order can be interchanged Straight forward approach, minimal effort Alternative? second frequency moment

Second frequency moment Article rating distribution as a function of frequency moments Divide the continuous range of all possible ratings into n segments (classes) with a width ε where scores fall into Ideally, ε= highest rating−lowest rating n segments Or within the context of scores from 1-5, ε = (5-1)/4 = 1.

Second frequency moment 𝑠𝑒𝑐𝑜𝑛𝑑 𝑚𝑜𝑚𝑒𝑛𝑡= 𝑖=1 𝑛 𝑛 𝑖 2 Where ni is the number of elements in the i’th class, or bin Our frequency, or score, is divided by n2 (total number of scores computed) to compute a score bounded between 0 and 1 (i.e normalize it) Reason: in practice articles have different amounts of rating data

Discussion depth The depth of a discussion can be an indicator of controversy Maximum depth, number of comments on a talk page, … The more controversial an article, the higher the 'average’ depth of a reply-chain (discussion) between users Use the notion of an H-Index to compute the most reasonable estimate of where a comment is going to be at a given depth

H-Index Used in measuring the productivity and citation impact of a scholar A scholar with an index of h has published h papers each of which has been cited at least h times. Hirsch index or Hirsch number The H-Index of a discussion (comment-reply) tree is 𝐻−𝐼𝑛𝑑𝑒𝑥= 𝑖 𝑚𝑎𝑥 min 𝑓 𝑖 ,𝑖 (where f(i) corresponds to the number of replies at depth i.) Figure 4. A discussion tree with a max-depth of 7 and H-Index of 3.

Experimental design: Article distribution 2.85GB of .tsv (tab separated) data, roughly ~1,200,000 article ratings parsed in a Go program I wrote: timestamp page_id page_title page_namespace rev_id user_id rating_key rating_value 20110722000002 29543332 RC_Timişoara 0 419784624 0 1 5 20110722000002 29543332 RC_Timişoara 0 419784624 0 2 5 20110722000002 29543332 RC_Timişoara 0 419784624 0 3 5 20110722000002 29543332 RC_Timişoara 0 419784624 0 4 5

Experimental design: Article distribution Top 50 handpicked results: Article Second Moment Germanic Wars 0.04 Khader_Adnan 0.05 Felix_Z._Longoria,_Jr. 0.09 Timeline_of_the_2011–2012_Egyptian_revolution .. 0.179 Deadgirl_(2008_film) 0.18 Abdul_Hakeem,_Pakistan 0.180 Non-lethal_weapon 0.185 Gaza 0.186 Yuri_Sidorenko 0.190 Kidnapping_of_children_by_Nazi_Germany 0.210

Experimental design: Article distribution What I noticed was that in general there was a relationship between controversy and article rating distribution, but it wasn’t nearly as obvious as the H-Index approach There were very few articles with a frequency < 0 I filtered out articles with little to no rating since they were unreliable (< 10 ratings) The average number of ratings was 12. Conclusion: Article rating data simply isn’t particularly reliable, since often times users rating the article are critical of the article itself rather than simply reflecting their personal feelings toward the topic

Experimental design: H-Index of articles How do we traverse a webpage, specifically a Wiki discussion? Implement an algorithm to traverse the DOM tree Start of a comment: <dl> or <dd>, end </dl> or </dd>

Experimental design: H-Index of articles Simple Tree-traversal Algorithm: Keep a map (integer->integer) of all depths with key i corresponding to the number of comments at depth i Given the set of all tokens on the webpage, traverse each token If the token is an opening tag to a comment (<dd> or <dl>, choose one) increment depths[currentDepth] by 1 and currentDepth by 1. If the token is a closing tag to a comment, decrement currentDepth by 1. When no more comments are to be parsed the result is a map containing all of the levels of depth and the number of comments at each depth.

Experimental design: H-Index of articles Result: 1,000,000 articles traversed in 3.5-4 hours (80-90 threads) Only 5% of the traversed articles had talk pages (meaning that in all of English Wikipedia only 260,000 have talk pages) Most articles have an H-Index of 2

Experimental design: H-Index of articles Top 15 results: Article H-Index Max-Depth Muawiyah I/Archive 1 21 29 Nagorno-Karabakh/Archive 5 16 19 Time Cube/Archive 1 Jehovah's Witnesses/Archive 25 24 John Vincent Atanasoff/Archive 13 23 Freemasonry/Archive 13 Ebionites/Archive 3 15 22 Soviet invasion of Manchuria/Archive 3 Political correctness/Archive 17 27 Societal attitudes toward homosexuality /Archive 2 14 Two envelopes problem/Archive 1 13 Gaza War/Archive 47 Political positions of John McCain/Archive 2 List of states with limited recognition/Archive 8 17 Race and intelligence/Archive 74 12

Experimental design: H-Index of articles Sample output sorted by number of comments:: 21 29 Muawiyah I/Archive 1 1670 13 19 Neuro-linguistic programming/Archive 20 1563 9 11 Chiropractic/Archive 8 1476 10 16 Chiropractic/Archive 10 1269 8 15 Monty Hall problem/Arguments/Archive 5 1163 8 12 Comparison of the health care systems in Canada and the United States/Archive 1 1002 10 15 Pseudoscience/Archive 10 991 14 25 Moment of inertia 974 11 14 Transcendental Meditation/Archive 22 962 13 20 Commonwealth realm/Archive 11 956 10 17 Historicity of Jesus/Archive 34 926 9 15 Stephen Barrett/Archive 4 887 13 Intelligent design/Archive 69 86413 19 Nagorno-Karabakh/Archive 7 8519 13 Monty Hall problem/Archive 30 84711 15 Acupuncture/Archive 9 8458

Experimental design: H-Index of articles Most of the articles with controversial discussions are archives Theory: A lot of articles on Wikipedia which generate controversy, particularly in their talk pages, are more controversial in the past since they get reverted and edited quickly (discussions don’t stay relevant forever or indefinitely) Looking at the number of comments produces similar reslts to looking at the H-Index (the more popular an article is, the more controversial it may be) Generally less than 10 articles on the list in the top 100 that were not controversial

Conclusion Generally speaking, there is no perfect method for predicting controversy Even the methods you’d think be 100 percent accurate aren’t necessarily Some methods are produce more interesting results I avoided relative controversy (the theory is in my paper) but in the future finding an implementation for that would be nice. Future work: relative controversy, improving upon statistical analysis