Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier.

Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier

 Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology  Input: user timeline tweets  Output: list of auto classified tweets

 Twitter allows users to create custom Friend Lists based on the user handles.

 Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.

 Step 1: Data Collection  Step 2: Text mining  Step 3: Creation of the training file for the library  Step 4: Evaluation of several classifiers  Step 5: Selecting the best classifier  Step 6: Validating the classification  Step 7: Tuning the parameters  Step 8: Repeat; until correct classification

 Remove special characters  Tokenize  Remove redundant letters in words  Spell Check  Stemming  Language Identification  Remove Stop Words  Generate bigrams and change to lower case

Go SF Giants! Such an amaazzzing feelin’!!!! \m/ :D SF Giants! amaazzzing feelin’!!!! \/ :D SF Giants amaazzzing feelin SF Giants amazing feeling SF Giants amazing feel me SF Giants amazing feel Stopwords Special chars Spell check Stemming stopwords

 Logistic Regression Classifier  Reasons:  Most popular linear classification technique for text classification  Ability to handle multiple categories with ease  Gave the best cross-validation accuracy and precision-recall score  Library: LIBLINEAR for Python

SF Giants amazing feel SF – 1 Giants -2 amazing-3 feel-4 SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1) 1 1:1 2:1 3:1 4:1 Boolean Training Input for the SVM Indexing

Andy, Marti & The Twitter Team

 Collected >2000 tweets from the “Who to follow” interest lists on Twitter for “Sports” and “Business”  Tweets were not purely “Sports” or “Business” related  Personal messages were prominent  Solution: Compared against a corpus of sports/business related terms and assigned weights accordingly

 Noise in the data: ▪ Tweets are in inconsistent format ▪ Lots of meaningless words ▪ Misspellings ▪ More of individual expression ▪ For example, BAAAAAAAAAAAASSKEttt!!!! bskball, futball, %, :D,\m/, ^xoxo Solution: Regular expressions and NLP toolkit  Different words, same root Playing, plays, playful -  play Solution: Stemming

 Mixed bag of sports(=1), finance(=2) tweets, entertainment(=3) and technology (=4)  Comma separated values of the categories that each tweet  Accuracy here is 94%. Precision: 0.89 Recall: 0.89  Experiment with different kernels for a better accuracy

 Category based tweets from  https://twitter.com/i/#!/who_to_follow/interests https://twitter.com/i/#!/who_to_follow/interests  Coding done in Python  Database – sqlite3  ML tool – lib SVM  Stemming – Porter’s Stemming  NLP Tool kit

Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier.

Similar presentations

Presentation on theme: "Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier.

Similar presentations

Presentation on theme: "Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier."— Presentation transcript:

Similar presentations

About project

Feedback