Vincent Fiore, Ange Assoumou, Debarshi Dutta, Kenneth Almodovar

Slides:



Advertisements
Similar presentations
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Advertisements

TOEFL iBT Listening Overview SectionContentTimeScore Listening4─6 lectures (5 minutes) 6 questions each 2─3 conversations (3 minutes) 5 questions each.
Boosting Approach to ML
Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Decision Tree Algorithm
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Planning and structuring your report Dr Michelle Reid Study Adviser, University of Reading.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
ACT: The Reading Test.
VIRTUAL BUSINESS RETAILING
Senior Thesis: Review of Literature Samples, Citation help, Search techniques.
Suggestions for better papers What the judges look for in a paper Why judges reject papers.
Warm Up Answers 3. YYURYYUBICURYY4ME Coffin
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Permission-based Malware Detection in Android Devices REU fellow: Nadeen Saleh 1, Faculty mentor: Dr. Wenjia Li 2 Affiliation: 1. Florida Atlantic University,
Critical Analysis Key ideas to remember. What's the Point? Here are some questions you can ask yourself to help you analyze: So what? How is this significant?
Organizing information and avoiding plagiarism.  Note cards should contain:  adequate identification of the source  a brief summary of the information.
Research Methodology and Writing 2013 Fall. The Outline Form P.45 P.45.
Prediction of Influencers from Word Use Chan Shing Hei.
Presented by Teererai Marange. According to Caliskan-Islam et al.(2015), authorship attribution using the Code Stylometry feature set is possible when.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
Scientific Debugging. Errors in Software Errors are unexpected behaviors or outputs in programs As long as software is developed by humans, it will contain.
Post-Ranking query suggestion by diversifying search Chao Wang.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Principals of Research Writing. What is Research Writing? Process of communicating your research  Before the fact  Research proposal  After the fact.
Observations on Scientific Writing in Social Sciences Dr. Abdalla Kafeel Associate Professor, Management Garden City College for Science & Technology
1 Some Guidelines for Good Research Dr Leow Wee Kheng Dept. of Computer Science.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
STAR LESSONS: Comprehension Strategies. Comprehension Strategies Reading Process Skills Reading Process Skills Make predictions Make predictions Identify.
Project Deliverable-1 -Prof. Vincent Ng -Girish Ramachandran -Chen Chen -Jitendra Mohanty.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
The Price of Free Privacy Leakage in Personalized Mobile In-App Ads
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
AP CSP: Cleaning Data & Creating Summary Tables
Sentiment Analysis of Twitter Messages Using Word2Vec
© Cengage Learning. All rights reserved.
Hierarchical Clustering: Time and Space requirements
Research Methodologies
Tutorial 2 – Editing and Formatting a Document
Common Thesis Problems
© Cengage Learning. All rights reserved.
Variables are factors that change or can be changed.
TOEFL iBT Listening Overview
How to create an effective test or survey instrument
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Minnesota Contextual Content Analysis
Forecasting The Future of Movies
Artificial Intelligence with Heart: Improving Customer Experience through Sentiment Analysis.
The Literature Review.
Office of Education Improvement and Innovation
Proportion of Original Tweets
A New Boosting Algorithm Using Input-Dependent Regularizer
© Cengage Learning. All rights reserved.
David Harwin Adviser: Petros Faloutsos
iSRD Spam Review Detection with Imbalanced Data Distributions
Dynamic Authentication of Typing Patterns
Ying Dai Faculty of software and information science,
© Cengage Learning. All rights reserved.
Stylistic Author Attribution Methods for Identifying Parody Trump Tweets Isaac Pena Department of Computer Science, Yale University, New Haven, CT LILY.
Austin Karingada, Jacob Handy, Adviser : Dr
The Research Process & Surveys, Samples, and Populations
Presentation transcript:

Vincent Fiore, Ange Assoumou, Debarshi Dutta, Kenneth Almodovar The Correlation between the Topic and Emotion of Tweets through Machine Learning Vincent Fiore, Ange Assoumou, Debarshi Dutta, Kenneth Almodovar http://www.csis.pace.edu/~ctappert/it691-17spring/projpresentationsmid.htm http://www.csis.pace.edu/~ctappert/it691-17spring/projpresentationsfin.htm http://www.csis.pace.edu/~ctappert/it691-17spring/

Our Project Classifying Tweets based on topic and emotion. Happy Sad Angry Religion Politics Family

Purpose Searching for a correlation between a topic and an emotion. Using machine learning to classify Tweets, then plotting these data points against each other.

Background Many papers have been written on similar topics We used these to get an idea of where to start our research. Performed a literature review in each of these topics to develop a better background. Ideas on preprocessing Tweets. Using word lists. Edges cases to keep in mind and avoid.

Methodology Manually categorizing Tweets. Creation of large word lists Used to train the ML algorithm Creation of large word lists Used to create the feature set to train the algorithm with This is the raw data that is fed into the algorithm Turns Tweets into number Confirming the training has worked Classifying Tweets beyond the original data set Plotting these new classifications to search for correlations

Manual Classification and Word Lists Incredibly difficult process due to the nature of the topics. Emotion is easy, but topics can be vague. Politics, for example, changed wildly since this data set was created in 2009. Word lists are also challenging Need to balance amount of words on list with possible occurrences. In one case, one emotion’s list was much larger than the others, so almost every Tweet was preliminarily marked as that emotion. This had to be fixed.

Implementation Using the manual classification, we created 6 different lists in the following categories: Politics Religion Family Anger Happy Sadness Each of these contains hundreds of words.

Implementation (continued) Figure 1.2 The figure shows an example of The Word Lists Our biggest list contained more than 600 words.

Implementation (continued) Figure 1.3 Display the rate of words per tweet

Results Each classifier, on their own, scored roughly 90% accuracy on the test data set This meant that, when compared to our pre-classified tweets, the algorithm was roughly three times more accurate than random guessing. (33%, IE: randomly choosing one of the three categories)

Results Classification of Donald Trump’s tweets Classification of Pope Francis’s tweets

Results Clear trends for individual users For the most part: President was political and either sad or angry Religious leaders were religious and happy The Dalai Lama The Dalai Lama, for example, was overwhelmingly positive

Results Need for further study Classifier worked best with users whose tweets fell into a clear category. Left: Nancy Pelosi’s tweets as classified. They are usually political, but show a large number of religious tweets.

Results Religious tweets are over represented For tweets that are not classified as political, family, or religious, the classifier tended to categorize these as religious. This means that certain users’ tweets are overly classified as religious. Due to the nature of machine learning, tracking down this issue proved too problematic.

Results Tweets from Dr. Phil, left, illustrate the issue of overclassification of religion. However, the emotion behind the tweet is still correct.

Conclusion It proved difficult to extrapolate the results for a universal data set Most users tend to tweet in their own specific style. For the most part, this doesn’t transcend all users and topics. There are certain topics that do show a correlation, like religion and sadness, but for the most part, emotion and topic do not show an obvious link. The link only exists for users and their own style of tweeting. IE: Politicians speak about politics and religious leaders about religion. Emotion is even more specific in its relation to each individual.

Further study Needs to focus on gathering much larger samples of tweets. This will help weed out any possible trends between the two subjects in Twitter at large. Also needs further development time of the classifier On its own, each classifier was scoring roughly 90% accuracy, but this dropped once combined. The error level was multiplied, causing errors like the religion issue that was encountered.