Detecting and Classifying Duplicate Tweets Thomas Wack
Agenda Introduction Related Works Methods Challenges Initial Results Future Plans Q&A
Introduction Goals of the project are two-fold Detect duplicate tweets from a data set Classify the duplicate tweets based on how similar they are
Goal 1 The duplication portion is broken into two phases: The first phase is based on a static data set This algorithm will sort through pre existing sets of tweets and attempt to detect the duplicate tweets that are contained within it The second phase is based on a dynamic data set This algorithm will take pre-existing sets of tweets and periodically update them with new tweets before proceeding to detect any new duplicates that were added
Goal 2 The classification used for the project is based off the Groundhog Day Exact Copy Nearly Exact Copy Strong Near Duplicate Weak Near Duplicate Weak Overlap
Related Works Groundhog Day: Near-Duplicate Detection on Twitter by Tao et. al. Uses three different categories of classifiers to detect duplicate data Uses five different classes to cluster duplicate tweets that are detected
Method Data Collection The Script Tweets about the Boston Marathon Bombing from Apollo The Script Load the data into the program Take two tweets and run a comparison check on them Depending on their level of duplication add them to the correct cluster
Method Twitter Data (stored locally) Python Script Load Tweet A Tweet B Compare Tweet Pair Cluster
Challenges Weak Near Duplicates Weak Overlap Inefficiency Detecting similar core messages can be fairly easy Detecting different personal views can be quite difficult Weak Overlap While detecting similar core messages CAN be fairly easy, it is made more difficult when the words making the message are almost all different Inefficiency
Initial Results Exact Copies Nearly Exact Copies Strong Near Duplicate Classifies these perfectly Nearly Exact Copies Strong Near Duplicate Classifies these very well Weak Near Duplicate Not finished Weak Overlap Exact Copies: Typos could lead to wrong classification Near Exact Copies: Strong Near Duplicates: Problems arise from trying to distinguish between just adding more detail and there being differing personal views
Future Plans Wrap up Phase 1 Phase 2! Tighten up Weak Near Duplicate and Weak Overlap Phase 2! Gather Twitter data and put it in Amazon Web Services S3 Run initial detection and classification Rerun these steps at a set, periodic time (every half hour?) Observe how the clusters change based on the new information that is coming in
Future Method Twitter Twitter Data (stored in AWS) Python Script Load Tweet A Tweet B Compare Tweet Pair Cluster
Q&A Any Questions?