Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

University of Sheffield NLP Module 11: Advanced Machine Learning.
Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Chapter 5: Introduction to Information Retrieval
Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.
Sentiment Analysis on Twitter Data
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Final Project of Information Retrieval and Extraction by d 吳蕙如.
Information Retrieval in Practice
Decision Tree Rong Jin. Determine Milage Per Gallon.
20/07/2000, Page 1 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Corpus Processing & Feature Vector Extraction A. Xafopoulos,
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Text Classification using SVM- light DSSI 2008 Jing Jiang.
Automated Patent Classification By Yu Hu. Class 706 Subclass 12.
Introduction to Python
Python File Handling. In all the programs you have made so far when program is closed all the data is lost, but what if you want to keep the data to use.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Chapter 6: Information Retrieval and Web Search
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
PCI th Panhellenic Conference in Informatics Clustering Documents using the 3-Gram Graph Representation Model 3 / 10 / 2014.
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
IR Homework #1 By J. H. Wang Mar. 5, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Introduction to Python Dr. José M. Reyes Álamo. 2 Three Rules of Programming Rule 1: Think before you program Rule 2: A program is a human-readable set.
Class Imbalance in Text Classification
Reputation Management System
IR Homework #1 By J. H. Wang Mar. 25, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Big Data Processing of School Shooting Archives
Information Retrieval in Practice
A Simple Approach for Author Profiling in MapReduce
Sentiment Analysis of Twitter Messages Using Word2Vec
Name: Sushmita Laila Khan Affiliation: Georgia Southern University
Logistic Regression: To classify gene pairs
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Evaluating Classifiers
A Straightforward Author Profiling Approach in MapReduce
MMS Software Deliverables: Year 1
Classifying enterprises by economic activity
Intro to Machine Learning
Project 1 Binary Classification
Machine Learning with Weka
Project 1: Text Classification by Neural Networks
Chapter 5: Information Retrieval and Web Search
Predicting Prevalence of Influenza-Like Illness From Geo-Tagged Tweets
Intro to Machine Learning
Flowcharts and Pseudo Code
Introduction to Sentiment Analysis
Introduction to Search Engines
Machine Learning for Cyber
Presentation transcript:

Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

 Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology  Input: user timeline tweets  Output: list of auto classified tweets

 Twitter allows users to create custom Friend Lists based on the user handles.

 Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.

 Step 1: Data Collection  Step 2: Text mining  Step 3: Creation of the training file for the library  Step 4: Evaluation of several classifiers  Step 5: Selecting the best classifier  Step 6: Validating the classification  Step 7: Tuning the parameters  Step 8: Repeat; until correct classification

 Remove special characters  Tokenize  Remove redundant letters in words  Spell Check  Stemming  Language Identification  Remove Stop Words  Generate bigrams and change to lower case

Go SF Giants! Such an amaazzzing feelin’!!!! \m/ :D SF Giants! amaazzzing feelin’!!!! \/ :D SF Giants amaazzzing feelin SF Giants amazing feeling SF Giants amazing feel me SF Giants amazing feel Stopwords Special chars Spell check Stemming stopwords

 Logistic Regression Classifier  Reasons:  Most popular linear classification technique for text classification  Ability to handle multiple categories with ease  Gave the best cross-validation accuracy and precision-recall score  Library: LIBLINEAR for Python

SF Giants amazing feel SF – 1 Giants -2 amazing-3 feel-4 SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1) 1 1:1 2:1 3:1 4:1 Boolean Training Input for the SVM Indexing

Andy, Marti & The Twitter Team

 Collected >2000 tweets from the “Who to follow” interest lists on Twitter for “Sports” and “Business”  Tweets were not purely “Sports” or “Business” related  Personal messages were prominent  Solution: Compared against a corpus of sports/business related terms and assigned weights accordingly

 Noise in the data: ▪ Tweets are in inconsistent format ▪ Lots of meaningless words ▪ Misspellings ▪ More of individual expression ▪ For example, BAAAAAAAAAAAASSKEttt!!!! bskball, futball, %, :D,\m/, ^xoxo Solution: Regular expressions and NLP toolkit  Different words, same root Playing, plays, playful -  play Solution: Stemming

 Mixed bag of sports(=1), finance(=2) tweets, entertainment(=3) and technology (=4)  Comma separated values of the categories that each tweet  Accuracy here is 94%. Precision: 0.89 Recall: 0.89  Experiment with different kernels for a better accuracy

 Category based tweets from   Coding done in Python  Database – sqlite3  ML tool – lib SVM  Stemming – Porter’s Stemming  NLP Tool kit