WikiTrust: Turning Wikipedia Quantity into Quality B. Thomas Adler, Luca de Alfaro, and Ian Pye.

Slides:



Advertisements
Similar presentations
Using a wiki for information services: principles and practicalities Peter Blake Electronic Services Librarian
Advertisements

Analysis of Algorithms
Measuring Reliability in Wikipedia Wen-Yuan Zhu
Measuring Author Contributions to the Wikipedia B. Thomas Adler, Luca de Alfaro, Ian Pye and Vishwanath Raman Computer Science Dept. UC Santa Cruz, CA,
Wiki History of wiki & who created it What is Wiki When it was created How wiki works.
Research & Referencing SED1007: Week 6. Do you trust your sources?  You can find plenty of web pages saying…  “Windows 7 is better than OS X”  “OS.
Haiku Gradebook Tutorial
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
TRADING OFF PREDICTION ACCURACY AND POWER CONSUMPTION FOR CONTEXT- AWARE WEARABLE COMPUTING Presented By: Jeff Khoshgozaran.
To trust or not, is hardly the question! Sai Moturu.
Computers in Society Wikipedia. Teams Team 1: Skyler, Austin, Julian, Jordan Team 2: Rory, Jill, Cameron Team 3: Bobs, Ryan, Stephen Team 4: Cresta, Matt,
A Content-Driven Reputation System for the Wikipedia Nan Li
Wikis This work is licensed under a Creative Commons Attribution-Noncommercial- Share Alike 3.0 License. Skills (application development): wiki editing.
Chapter 10: Virtual Memory. 9.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 22, 2005 Chapter 10: Virtual Memory.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Using Wikispaces This work is licensed under a Creative Commons Attribution-Noncommercial- Share Alike 3.0 License. Skills: Wikispaces: editing and management.
Introduction to VBA. This is not Introduction to Excel We’re going to assume you have a basic level of familiarity with Excel If you don’t, or you need.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Dividing by a Fraction. What does it mean to divide by a fraction?
EBD for Dental Staff Seminar 2: Core Critical Appraisal Dominic Hurst evidenced.qm.
Wiki Culture & Collaboration Presented by: Faria Sami Quratulain Shattari Munim Ahmed Zaid Nizami.
Simulation Examples ~ By Hand ~ Using Excel
Recall … Process states –scheduler transitions (red) Challenges: –Which process should run? –When should processes be preempted? –When are scheduling decisions.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
GZLM... including GEE. Generalized Linear Modelling A family of significance tests... Something we don’t see mentioned much in articles yet... but will.
Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Discrete Distributions The values generated for a random variable must be from a finite distinct set of individual values. For example, based on past observations,
Hunting for Metamorphic Engines Wing Wong Mark Stamp Hunting for Metamorphic Engines 1.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Andrew G. West and Insup Lee PAN-CLEF `11 – Wikipedia Vandalism Detection September 21, 2011 Multilingual Vandalism Detection using Language- Independent.
Time Management Personal and Project. Why is it important Time management is directly relevant to Project Management If we cannot manage our own time.
Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,
NA62 Trigger Algorithm Trigger and DAQ meeting, 8th September 2011 Cristiano Santoni Mauro Piccini (INFN – Sezione di Perugia) NA62 collaboration meeting,
Tajik Wikipedia Free Encyclopedia Ibrahim Rustamov Note: To view pages on the Internet properly with all Tajik letters, please.
Prediction of Influencers from Word Use Chan Shing Hei.
Losing Weight (a) If we were to repeat the sampling procedure many times, on average, the sample proportion would be within 3 percentage points of the.
Cost and Management Accounting: An Introduction, 7 th edition Colin Drury ISBN © 2011 Cengage Learning EMEA Process costing CHAPTER 6.
REPUTATION SYSTEMS FOR OPEN COLLABORATION CACM 2010 Bo Adler, Luca de Alfaro, Ashutosh Kulshreshtha, Ian Pye Reviewed by : Minghao Yan.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Consensus Extraction from Heterogeneous Detectors to Improve Performance over Network Traffic Anomaly Detection Jing Gao 1, Wei Fan 2, Deepak Turaga 2,
Computer Programming Application Friday 10/29/2010.
Week 10 Emily Hand UNR.
Making a great Project 2 OCR 1994/2360. Implementation This is about how you make your system. It should have enough detail for a competent user to be.
Stat 31, Section 1, Last Time Big Rules of Probability –The not rule –The or rule –The and rule P{A & B} = P{A|B}P{B} = P{B|A}P{A} Bayes Rule (turn around.
CMPE58H Project Progress Presentation QAPoint H.Tuğçe Özkaptan Gözde Kaymaz Serkan Kırbaş
Using a wiki This work is licensed under a Creative Commons Attribution-Noncommercial- Share Alike 3.0 License. Skills (application development): wiki.
Uncertainty2 Types of Uncertainties Random Uncertainties: result from the randomness of measuring instruments. They can be dealt with by making repeated.
Web Science 4/4/08 Information Reputation, Ratings, and Quality on the Web Nick Diakopoulos | School of Interactive Computing.
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
Turning coins Aim: To estimate a length of time (20 seconds) and discuss how different factors affect the results. Objective: To find out if reaction time.
Spam Filtering Using Statistical Data Compression Models Andrej Bratko, Bogdan Filipič, Gordon V. Cormack, Thomas R. Lynam, Blaž Zupan Journal of Machine.
PLAGIARISM Dr Cordelia Beattie School Academic Misconduct Officer.
Antisocial Behavior in Online Discussion Communities Authors: Justin Cheng, Cristian Danescu-Niculescu-Mizily, Jure Leskovec Presented by: Ananya Subburathinam.
Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task Magdalena Jankowska, Vlado Kešelj and Evangelos.
Lesson 7 -Collaborative Editing Objectives In this lesson we will: ● Introduce the idea of Wiki ethics, ● Explore the Recent changes page, ● and diff &
Reputation Systems For Open Collaboration, CACM 2010 Bo Adler, Luca de Alfaro et al. Nishith Agarwal
CPU Scheduling CSSE 332 Operating Systems
Disinformation on the Web:
Source: Procedia Computer Science(2015)70:
B. Jayalakshmi and Alok Singh 2015
Query Languages.
Using a wiki Skills: using a wiki
Wikipedia Network Analysis: Commonality detection among Wikipedia authors Deepthi Sajja.
How-to wiki “The Team” presents
Lecture 3: Communicate in Writing
Wikis Skills (application development): wiki editing and management
Kickstart 2010 On-line Research.
Presentation transcript:

WikiTrust: Turning Wikipedia Quantity into Quality B. Thomas Adler, Luca de Alfaro, and Ian Pye

Wikipedia: 3,000,000+ Article, 1,000,000,000+ Revisions Our Goal: Crowd-sourcing community consensus

Vandalism Prevents Wikipedia being taken fully seriously Harder to use Wikipedia in schools Harder to make static selections

Zero-delay: Use only those features which are available at the time the revision is created. (no lookahead) Historical: Use the full set of WikiTrust features, including how the revision is treated by subsequent authors. (lookahead) Vandalism Detection Given a new revision, classify as Vandalism or Regular

Wikipedia 1.0 Project: Aims to extract a static snapshot of Wikipedia. Use in Schools, Developing Countries, OLPC Project. Revision Selection Given an article, select the “best” revision to show to a user.

Core Concepts Wikipedia Article Many Revisions 1 Author per Revision Author has Reputation, Revision has Trust. Binary Classifier: Either A or B.

Zero Day Features Author is Anonymous (Turns out we don’t care) Time interval after the previous edit (Useful, but only as a predicate time > 12 seconds) Time of day of edit (Not used)

Zero Day Features Difference from previous revisions (Not really) Comment Length (Nope)

Zero Day Features (we care about these) Previous Text Trust Histogram Current Text Trust Histogram Histogram Difference

Text Trust New text starts with a trust value proportional to the author's reputation. Text can gain trust when revised. Cut-and-paste, deletions result in local trust loss. We remember deleted text and its trust.

A Sequence of Differences For revisions v 1, v 2, v 3... of a wiki, word trust is computed from the difference between v i, v i-1 How did we arrive at the current version of an article?

Text Trust: The Algorithm Illustrated 1) Trust of new text 1

Text Trust: The Algorithm Illustrated 1) Trust of new text 2) New block borders have the same trust as new text 22 2

Text Trust: The Algorithm Illustrated 1) Trust of new text 2) New block borders have the same trust as new text 3) The revision effect increases the trust of existing text 3 3

Text Trust: The Algorithm Illustrated 1) Trust of new text 2) New block borders have the same trust as new text 3) The revision effect increases the trust of existing text 4) Note: this is not a new border 4 4

Zero Day Features (we care about these) Previous Text Trust Histogram Current Text Trust Histogram Histogram Difference

Historical Features Next revision comment length (length > 110 chars) Next revision comment has the word revert in it (too noisy)

Historical Features Author Reputation (How do other users judge this user’s edits?)

Historical Features Minimum Revision Quality Average Revision Quality Maximum Dissent

Historical Features Total Weight of Judges (not at all)

ROC AUC Scoring >0.90 = Excellent = Good < 0.8 = Poor 0.5 = Expected result from flipping a coin Probability that a binary classifier is correct

Results (PAN 2010) ROC of 0.937

Results (PAN 2010) ROC of X ROC of ?

Results (PAN 2010) ROC of X ROC of ?

Other Directions Wikipedia 1.0 Vandalism API Newsgroup Reputation IP Address Reputation

The fraction of change that is in the same direction of the future. Qual = 1: v j is a totally good edit Qual = -1: v j is reverted -1 ≤ Qual ≤ 1 vivi vkvk vjvj “work done” d(v i, v j ) d( v i, v j )-d( v j, v k ) “progress” the past the future Revision Quality