Detecting and Classifying Duplicate Tweets

Slides:



Advertisements
Similar presentations
How do I create a category? Open the Copy Categories tab of Library Search, Media Search, or Textbook Search, in the Catalog. Next to Add Copy Category.
Advertisements

Skills and Techniques Lesson One.
Getting started Starting the Virtual Machines, utilities, intro to workflows using Trident ADD BUSINESS UNIT/FLAGSHIP NAME Nick Murray| March 2013.
Software Analysis at Philips Healthcare MSc Project Matthijs Wessels 01/09/2009 – 01/05/2010.
Reduction, abstraction, and atomicity: How much can we prove about concurrent programs using them? Serdar Tasiran Koç University Istanbul, Turkey Tayfun.
WELCOME TO A REAL ECONOMIC IMPACT TOUR 30 SECOND TRAINING.
Why python? Automate processes Batch programming Faster Open source Easy recognition of errors Good for data management What is python? Scripting programming.
Sensor network Routing protocol A study on LEACH protocol and how to improve it.
Key Considerations for Report Generation & Customization Richard Wzorek Director, Production IT Confidential © Almac Group 2012.
Distributed Indexed Outlier Detection Algorithm Status Update as of March 11, 2014.
 Copyright 2011 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute Enabling Networked Knowledge.
Hastings Purify: Fast Detection of Memory Leaks and Access Errors.
Create new database Create staging table Import new taxonomy Index new taxonomy Load new taxonomy to core db New TNRS DB New taxonomic source More taxonomic.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Presented by Brian Griffin On behalf of Manu Goel Mohit Goel Nov 12 th, 2014 Building a dynamic GUI, configurable at runtime by backend tool.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Load Balancing for Partition-based Similarity Search Date : 2014/09/01 Author : Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Source : SIGIR’14 Advisor.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Directions: Read each question.Read each question. Decide which answer you think is correct.Decide which answer you think is correct. Click the mouse.
Question Identification on Twitter Baichuan Li, Xiance Si, Michael R. Lyu, Irwin King, and Edward Y. Chang 10/9/20151.
Fast Response Governance Drill Deep and Wide Abel Acuña January 27 th 2010.
Pointers OVERVIEW.
Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.
Loading data into the ETC and into the MPM Baseline.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.
Methods of Installing a new computer-based system.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
SOFTWARE SYSTEM LABORATORY 1 COMPUTERED GRADUATION FORM Performers: Ofir Medlinsky Ahmad Hamdan Instructor: Victor Kulikov GF.
Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC
Microsoft Access Lesson 5 Lexington Technology Center February 25, 2003 Bob Herring On the Web at
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
Working Hours Utilities. An easy to use tool that helps maintain your company resources (setup working hours for users, update when facilities and equipment.
What is the Fastest Sorting A Computer Science Project by Timothy Hewitt Algorithm?
Fix: Windows 10 Error Code 0x in Mail App u/6/b/ /alexwaston14/reimage-system-repair/ /pages/Reimage-Repair-Tool/
1 Munther Abualkibash University of Bridgeport, CT.
Dec 14, 2014, Harvard University
SharePoint 101 – An Overview of SharePoint 2010, 2013 and Office 365
Sorting Chapter 14.
What Is Cluster Analysis?
Optimizing Parallel Algorithms for All Pairs Similarity Search
By : Namesh Kher Big Data Insights – INFM 750
MIS2502: Data Analytics Advanced Analytics - Introduction
CS115/MAS115: Computing for The Socio-Techno Web
Using Destiny Report Builder to Count Books Added to Collection Sorted by Funding Source 3/24/16.
Objective 1: Selecting interesting facts to report on
3.3 Fundamentals of data representation
Backpage aberdeen
Massachusetts Institute of Technology
Quadtrees 1.
PERFORMANCE AND TALENT MANAGEMENT
What-If Testing Framework
Lesson 12.
Recipe for a Database Problem Impact Approach
BENEFITS OF PAYROLL TRACKING
Principles of Computing – UFCFA3-30-1
Shell Sort and Merge Sort
Indexing 1.
AD642 Project Communication: Intro
LONG MULTIPLICATION is just multiplying two numbers.
General External Merge Sort
Principles of Computing – UFCFA3-30-1
Writing a Paragraph You’ll love it!!!.
MicroStrategy Academic user group meeting 30 January 2018
Presentation transcript:

Detecting and Classifying Duplicate Tweets Thomas Wack

Agenda Introduction Related Works Methods Challenges Initial Results Future Plans Q&A

Introduction Goals of the project are two-fold Detect duplicate tweets from a data set Classify the duplicate tweets based on how similar they are

Goal 1 The duplication portion is broken into two phases: The first phase is based on a static data set This algorithm will sort through pre existing sets of tweets and attempt to detect the duplicate tweets that are contained within it The second phase is based on a dynamic data set This algorithm will take pre-existing sets of tweets and periodically update them with new tweets before proceeding to detect any new duplicates that were added

Goal 2 The classification used for the project is based off the Groundhog Day Exact Copy Nearly Exact Copy Strong Near Duplicate Weak Near Duplicate Weak Overlap

Related Works Groundhog Day: Near-Duplicate Detection on Twitter by Tao et. al. Uses three different categories of classifiers to detect duplicate data Uses five different classes to cluster duplicate tweets that are detected

Method Data Collection The Script Tweets about the Boston Marathon Bombing from Apollo The Script Load the data into the program Take two tweets and run a comparison check on them Depending on their level of duplication add them to the correct cluster

Method Twitter Data (stored locally) Python Script Load Tweet A Tweet B Compare Tweet Pair Cluster

Challenges Weak Near Duplicates Weak Overlap Inefficiency Detecting similar core messages can be fairly easy Detecting different personal views can be quite difficult Weak Overlap While detecting similar core messages CAN be fairly easy, it is made more difficult when the words making the message are almost all different Inefficiency

Initial Results Exact Copies Nearly Exact Copies Strong Near Duplicate Classifies these perfectly Nearly Exact Copies Strong Near Duplicate Classifies these very well Weak Near Duplicate Not finished Weak Overlap Exact Copies: Typos could lead to wrong classification Near Exact Copies: Strong Near Duplicates: Problems arise from trying to distinguish between just adding more detail and there being differing personal views

Future Plans Wrap up Phase 1 Phase 2! Tighten up Weak Near Duplicate and Weak Overlap Phase 2! Gather Twitter data and put it in Amazon Web Services S3 Run initial detection and classification Rerun these steps at a set, periodic time (every half hour?) Observe how the clusters change based on the new information that is coming in

Future Method Twitter Twitter Data (stored in AWS) Python Script Load Tweet A Tweet B Compare Tweet Pair Cluster

Q&A Any Questions?