Stylistic Author Attribution Methods for Identifying Parody Trump Tweets Isaac Pena Department of Computer Science, Yale University, New Haven, CT LILY.

Slides:



Advertisements
Similar presentations
Albert Gatt Corpora and Statistical Methods Lecture 13.
Advertisements

October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.
M. George Physics Dept. Southwestern College
Scientific Research Dr. Noura Al-dayan.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Informative/Explanatory Writing
Chapter 6: Information Retrieval and Web Search
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
How To Write An Introduction and Conclusion. Introduction: Things you MUST include. S.T.A.T.S. Structure + Genre Theme/Context Audience/Purpose Tone/Mood.
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Results from Mean and Variance Calculations The overall mean of the data for all features was for the REF class and for the LE class. The.
Descriptive Statistics The means for all but the C 3 features exhibit a significant difference between both classes. On the other hand, the variances for.
Language Identification and Part-of-Speech Tagging
A Simple Approach for Author Profiling in MapReduce
If…then… for rhetorical analysis argument synthesis
Title Make it simple and catchy
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
ACT English Test Preparation
PREDICT 422: Practical Machine Learning
Academic writing.
1. Does the writer endorse or disendorse the following views?
Information Organization: Overview
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Common Core Basics Students in grades K-8 are given individual specific standards. While those in 9-10 and are grouped together. The emphasis throughout.
Assessing Students' Understanding of the Scientific Process Amy Marion, Department of Biology, New Mexico State University Abstract The primary goal of.
Instructional Uses of Test Results
Components of thesis.
Vincent Fiore, Ange Assoumou, Debarshi Dutta, Kenneth Almodovar
10.3 Details of Recursion.
کاربرد نگاشت با حفظ تنکی در شناسایی چهره
Statistical Data Analysis
More information than you ever thought you wanted to know!
Summarising skills and professional standards
The Need for Algorithms 2 days
Vocabulary Algorithm - A precise sequence of instructions for processes that can be executed by a computer.
Enhancing User identification during Reading by Applying Content-Based Text Analysis to Eye- Movement Patterns Akram Bayat Amir Hossein Bayat Marc.
MID-SEM REVIEW.
Natural Language Processing of Knee MRI Reports
What causes you the biggest problems when reading a textbook?
Support Vector Machines (SVM)
Using Transductive SVMs for Object Classification in Images
Detecting the Learning Value of Items In a Randomized Problem Set
Instructional Uses of Test Results
MATERIAL Resources for Cross-Lingual Information Retrieval
Writing reports Wrea Mohammed
Statistical Data Analysis
Domain Mixing for Chinese-English Neural Machine Translation
A Mobile Application for the Blind and Dyslexic
Unsupervised Pretraining for Semantic Parsing
Machine Learning for Visual Scene Classification with EEG Data
Door and Window Schedules
Using Multilingual Neural Re-ranking Models for Low Resource Target Languages in Cross-lingual Document Detection Using Multilingual Neural Re-ranking.
Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab
Deep Learning for the Soft Cutoff Problem
SYNTHESIS “For the purposes of scoring, synthesis refers to combining the sources and the writer’s position to form a cohesive, supported argument and.
Information Organization: Overview
U3L2 The Need For Algorithms
Machine Learning: Lecture 5
Scientific Laws & Theories
Austin Karingada, Jacob Handy, Adviser : Dr
Presentation transcript:

Stylistic Author Attribution Methods for Identifying Parody Trump Tweets Isaac Pena Department of Computer Science, Yale University, New Haven, CT LILY Lab Table 1. Percentages of parody tweets correctly identified vs incorrectly identified, in a feature space including word and pos bigrams Introduction Results Trump Douthat Chait Yglesias Huckabee Ioffe @DungeonsDonald 45.7 19.5 11.6 11.8 9.93 1.49 @realDonalDrumpf 30.3 24.2 12.2 9.39 17.2 6.66 @realDonaldTrFan 27.4 17.9 7.95 21.2 21.3 4.25 @realNFLTrump 34.9 19.4 11.1 15.1 17.5 1.98 @Writeintrump 29.7 27.2 10.5 9.24 3.93 Two primary problems present in the data for this experiment were clear throughout the process of gathering and compiling the results. The former is the use by multiple authors with distinct stylistic voices of certain Twitter accounts, namely @realDonaldTrump. As late as April 2016, Trump maintained a separate mobile device from that of his staffers and the tweets coming from his account, with his distinctive style, could be identified as his by the fact that they came from an Android. His staffers’ tweets were more promotional in nature and were stylistically quite different from Trump’s own. Since assuming the presidency, however, both Trump and staffers frequently tweet from iPhones, and there is no longer any way to distinguish them. The second is the presence of ‘bad’ parodies in the data – parody accounts who mock Trump but using another voice, like @realDonalDrumpf, which merely lambasts Trump for his most recent gaffe and makes no attempt to stylize themselves with ‘Trump voice’, as well as some of the others in the test set to varying degrees. On no feature space did the classifiers overcome the difficulties that these aspects of the data posed. The best feature spaces found had maximum correctness scores lower than 50 percent; the classifiers’ failures to correctly identify Trump parody as Trump parody leaves us with only the null hypothesis. An author with an immediately recognizable style generates immediately recognizable pastiches, or, if the intents of the pastiches are sarcastic, immediately recognizable parodies. Authorship attribution is a natural language processing task that can use both the style and the content of an author's (attributed) body of work to determine if an anonymous document was actually written by that author. An attempt at holistic author attribution needs to take into account non-content features in the case that the content of the training corpus is disjunct or near-disjunct from the content of the anonymous documents that we are trying to attribute to an author. We apply similar methods to parody, in which an author’s style is emulated but content – settings, characters – is changed. In this experiment the corpus used is the large and continually expanding group of similar texts published by Donald Trump. We also drew from five other pundits who have active Twitter accounts, and chose a handful of Trump parody accounts with various political perspectives, two of which largely tweet about non-political or tangentially political subjects in 'Trump voice’. gamma=0.05 gamma=0.5 gamma=5 C=100 Materials and Methods We built a pipeline resembling one that might be used in a traditional support vector machine setup for author attribution. The pipeline cross-validates training data with test data by withholding a validation set from the machine learning algorithm, then proceeds to extract features from the training set and classifies the items in the test set accordingly. t-SNE dimensionality reduction is preceded by either an expansion of the dense feature vector matrix or its decomposition using NLTK’s TruncatedSVD function, and the t-SNE 2-dimensionalization of the feature vectors is plotted. Lastly, the pipeline displays the number of tweets in the test set - organized by actual author - which the classifier guesses correctly alongside the breakdown of incorrect guesses. This makes it easier to visualize which authors are dimensionally “close” to other authors in the particular feature space being examined, and helps to illustrate some of the deeper problems that emerge from the data itself. C=1000 Conclusion The primary reason for the failure of this experiment to produce a meaningful answer to the question was the data - any improvements on the experiment must begin there, and will require finding a corpus by an author that uses a single voice that has spun off as many parodies as has Trump. Additionally, our feature spaces contrasted style-sensitive approaches towards parody identification with style-insensitive approaches. Future experimenters should consider looking further afield for features to extract. and may want to consider what idiosyncratic phrastic devices an author uses before considering what means could extract such features from the text. C=5000 Figure 2. t-SNE visualizations of classifications using various feature spaces. From top to bottom: dependency relations, distance from dependency to head, word and pos. Left: bigrams. Right: trigrams. Figure 1. Part-of-speech trigram t-SNE visualizations of the test set under different parameters Acknowledgement I would like to thank Prof. Dragomir Radev for all his help in making this project succeed.