Project Deliverable-1 -Prof. Vincent Ng -Girish Ramachandran -Chen Chen -Jitendra Mohanty
Agenda Pre-processing of tweets Research literatures studied and motivation Next 2-weeks Plans
Pre-processing Tasks Completed: Parsed all the files provided by Raytheon and extracted tweets of ~18GB. Tweets doesn’t have meta-data associated with it for time being. Tweets containing non-ascii characters and new-line characters are discarded. –POS tagger stopped processing the tweets containing above characters. Tasks to be addressed: Approximately 2 weeks to POS tag, Chunking and NER all the tweets that we have currently at our disposal.
Research Literatures Studied Several research literatures have been studied to get an idea of the prior work in this field. –Sentiment Analysis –Opinion-Target pairs –Latent user attributes –Event Detection –POS and NER for twitter data-set –Domain Adaptation Reference to all the research literatures can be found on wiki maintained by our team.
Motivation behind studying research literatures Sentiment Analysis provides background to examine sentiment of a person on a topic, an abstract or a discussion etc. –Classifying the polarity of a given text at the document, sentence, or feature/aspect level. –Generally, sentiments means positive, negative, or neutral. –This could be extended to emotional states of a person such as angry, sad or happy. Latent user attributes –For our project, we need to construct profile. –Profile associated with meta-data. Name, Profile Id, Tweet Id, location (geo-stationary or profile creation) etc. –Some meta-data are not available as part of tweets meta-data. Gender, age, political orientation, region
Motivation behind studying research literatures contd… Event Detection –Event is basically an observable phenomena or occurrence. Ex. Earthquake, war, flood –People have different opinion. –Zero-in on an event and start analyzing the sentiment of a person over a definite period during that effect of the event. POS and NER for twitter data-set (continuing…) –Existing tool (such as Alan Ritter’s POS tagging for twitter) is currently being used for part-of-speech tagging and named-entity recognition. –This will be used as feature in our learning algorithm. Domain Adaptation –How the model behaves in a different data-set.
Next 2-weeks plans Complete POS tagging and NER in next 2-3 weeks using existing tool. Annotating tweets. Identifying the domains/issues that we will be concentrating on and finding the active users in the domains/issues. –Key words to be used to search domains/issues. –Group the tweets with respect to domains –Find the active users in each domain.
Difficulties Faced Feature selection POS tagging and NER Removing non-ascii characters