WikiTrust: Turning Wikipedia Quantity into Quality B. Thomas Adler, Luca de Alfaro, and Ian Pye
Wikipedia: 3,000,000+ Article, 1,000,000,000+ Revisions Our Goal: Crowd-sourcing community consensus
Vandalism Prevents Wikipedia being taken fully seriously Harder to use Wikipedia in schools Harder to make static selections
Zero-delay: Use only those features which are available at the time the revision is created. (no lookahead) Historical: Use the full set of WikiTrust features, including how the revision is treated by subsequent authors. (lookahead) Vandalism Detection Given a new revision, classify as Vandalism or Regular
Wikipedia 1.0 Project: Aims to extract a static snapshot of Wikipedia. Use in Schools, Developing Countries, OLPC Project. Revision Selection Given an article, select the “best” revision to show to a user.
Core Concepts Wikipedia Article Many Revisions 1 Author per Revision Author has Reputation, Revision has Trust. Binary Classifier: Either A or B.
Zero Day Features Author is Anonymous (Turns out we don’t care) Time interval after the previous edit (Useful, but only as a predicate time > 12 seconds) Time of day of edit (Not used)
Zero Day Features Difference from previous revisions (Not really) Comment Length (Nope)
Zero Day Features (we care about these) Previous Text Trust Histogram Current Text Trust Histogram Histogram Difference
Text Trust New text starts with a trust value proportional to the author's reputation. Text can gain trust when revised. Cut-and-paste, deletions result in local trust loss. We remember deleted text and its trust.
A Sequence of Differences For revisions v 1, v 2, v 3... of a wiki, word trust is computed from the difference between v i, v i-1 How did we arrive at the current version of an article?
Text Trust: The Algorithm Illustrated 1) Trust of new text 1
Text Trust: The Algorithm Illustrated 1) Trust of new text 2) New block borders have the same trust as new text 22 2
Text Trust: The Algorithm Illustrated 1) Trust of new text 2) New block borders have the same trust as new text 3) The revision effect increases the trust of existing text 3 3
Text Trust: The Algorithm Illustrated 1) Trust of new text 2) New block borders have the same trust as new text 3) The revision effect increases the trust of existing text 4) Note: this is not a new border 4 4
Zero Day Features (we care about these) Previous Text Trust Histogram Current Text Trust Histogram Histogram Difference
Historical Features Next revision comment length (length > 110 chars) Next revision comment has the word revert in it (too noisy)
Historical Features Author Reputation (How do other users judge this user’s edits?)
Historical Features Minimum Revision Quality Average Revision Quality Maximum Dissent
Historical Features Total Weight of Judges (not at all)
ROC AUC Scoring >0.90 = Excellent = Good < 0.8 = Poor 0.5 = Expected result from flipping a coin Probability that a binary classifier is correct
Results (PAN 2010) ROC of 0.937
Results (PAN 2010) ROC of X ROC of ?
Results (PAN 2010) ROC of X ROC of ?
Other Directions Wikipedia 1.0 Vandalism API Newsgroup Reputation IP Address Reputation
The fraction of change that is in the same direction of the future. Qual = 1: v j is a totally good edit Qual = -1: v j is reverted -1 ≤ Qual ≤ 1 vivi vkvk vjvj “work done” d(v i, v j ) d( v i, v j )-d( v j, v k ) “progress” the past the future Revision Quality