Download presentation
Presentation is loading. Please wait.
Published byParis Wharff Modified over 9 years ago
1
Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier
2
Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology Input: user timeline tweets Output: list of auto classified tweets
3
Twitter allows users to create custom Friend Lists based on the user handles.
4
Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.
5
Step 1: Data Collection Step 2: Text mining Step 3: Creation of the training file for the library Step 4: Evaluation of several classifiers Step 5: Selecting the best classifier Step 6: Validating the classification Step 7: Tuning the parameters Step 8: Repeat; until correct classification
6
Remove special characters Tokenize Remove redundant letters in words Spell Check Stemming Language Identification Remove Stop Words Generate bigrams and change to lower case
7
Go SF Giants! Such an amaazzzing feelin’!!!! \m/ :D SF Giants! amaazzzing feelin’!!!! \/ :D SF Giants amaazzzing feelin SF Giants amazing feeling SF Giants amazing feel me SF Giants amazing feel Stopwords Special chars Spell check Stemming stopwords
8
Logistic Regression Classifier Reasons: Most popular linear classification technique for text classification Ability to handle multiple categories with ease Gave the best cross-validation accuracy and precision-recall score Library: LIBLINEAR for Python
9
SF Giants amazing feel SF – 1 Giants -2 amazing-3 feel-4 SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1) 1 1:1 2:1 3:1 4:1 Boolean Training Input for the SVM Indexing
11
Andy, Marti & The Twitter Team
13
Collected >2000 tweets from the “Who to follow” interest lists on Twitter for “Sports” and “Business” Tweets were not purely “Sports” or “Business” related Personal messages were prominent Solution: Compared against a corpus of sports/business related terms and assigned weights accordingly
14
Noise in the data: ▪ Tweets are in inconsistent format ▪ Lots of meaningless words ▪ Misspellings ▪ More of individual expression ▪ For example, BAAAAAAAAAAAASSKEttt!!!! bskball, futball, %, :D,\m/, ^xoxo Solution: Regular expressions and NLP toolkit Different words, same root Playing, plays, playful - play Solution: Stemming
16
Mixed bag of sports(=1), finance(=2) tweets, entertainment(=3) and technology (=4) Comma separated values of the categories that each tweet Accuracy here is 94%. Precision: 0.89 Recall: 0.89 Experiment with different kernels for a better accuracy
17
Category based tweets from https://twitter.com/i/#!/who_to_follow/interests https://twitter.com/i/#!/who_to_follow/interests Coding done in Python Database – sqlite3 ML tool – lib SVM Stemming – Porter’s Stemming NLP Tool kit
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.