Download presentation
Presentation is loading. Please wait.
Published byOctavia Hunt Modified over 9 years ago
1
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio
2
LABS Basic text analytics: text classification using bags-of-words – Sentiment analysis of tweets using Python’s SciKit Learn library More advanced text analytics: information extraction using NLP pipelines – Named Entity Recognition
3
LABS Basic text analytics: text categorization using bags-of-words – Specifically, sentiment analysis of tweets using Python’s SciKit-Learn’s library More advanced text analytics: information extraction using NLP pipelines – Named Entity Recognition
4
Sentiment analysis using SciKit Learn Materials for this part of the tutorial: – http://csee.essex.ac.uk/staff/poesio/Teach/TextAn alyticsTutorial/SentimentLab http://csee.essex.ac.uk/staff/poesio/Teach/TextAn alyticsTutorial/SentimentLab – Based on: chap. 6 of
5
TEXT ANALYTICS IN PYTHON Not quite as easy to do text manipulation in Python as in Perl, but a number of useful packages – SCIKIT-LEARN for machine learning including basic text classification – NLTK for NLP processing including libraries for tokenization, POS tagging, chunking, parsing, NE recognition; also support for ML-based methods eg for text classification
6
TEXT ANALYTICS IN PYTHON Not quite as easy to do text manipulation in Python as in Perl, but a number of useful packages – SCIKIT-LEARN for machine learning including basic text classification – NLTK for NLP processing including libraries for tokenization, POS tagging, chunking, parsing, NE recognition; also support for ML-based methods eg for text classification
7
SCIKIT-LEARN An open-source library supporting machine learning work – Based on numpy, scipy, and matplotlib Provides implementations of – Several supervised ML algorithms including eg regression, Naïve Bayes, SVMs – Clustering – Dimensionality reduction – It includes several facilities to support text classification including eg ways to create NLP pipelines out of componen td Website: – http://scikit-learn.org/stable/
8
REMINDER : SENTIMENT ANALYSIS (or opinion mining) Develop algorithms that can identify the ‘sentiment’ expressed by a text – Product X sucks – I was mesmerized by film Y
9
SENTIMENT ANALYSIS AS TEXT CATEGORIZATION Sentiment analysis can be viewed as just another type of text categorization, like spam detection or topic classification Most successful approaches use SUPERVISED LEARNING: – Use corpora annotated for subjectivity and/or sentiment – To train models using supervised machine learning algorithms: Naïve bayes Decision trees SVM Good results can already be obtained using only WORDS as features
10
TEXT CATEGORIZATION USING A NAÏVE BAYES, WORD-BASED APPROACH Attributes are text positions, values are words.
11
SENTIMENT ANALYSIS OF TWEETS A very popular application of sentiment analysis is trying to extract sentiment towards products or organizations from people’s comments about them on Twitter Several datasets for that – E.g., SEMEVAL-2014 In this lab: Nick Sanders’s dataset – 5000 Tweets – Annotated as positive / negative / neutral / irrelevant – List of ID / sentiment pairs, + script to download tweets on the basis of their ID
12
First Script Open the file: 01_start.py (but do not run it yet!!) Start an IDLE window
13
A word-based, Naïve Bayes sentiment analyzer using SciKit-Learn The library sklearn.naive_bayes includes implementations of three Naïve Bayes classifiers – GaussianNB (for features that have a Gaussian distribution, e.g., physical traits – height, etc) – MultinomialNB (when features are frequencies of words) – BernoulliNB (for boolean features)
14
A word-based, Naïve Bayes sentiment analyzer using SciKit-Learn The library sklearn.naive_bayes includes implementations of three Naïve Bayes classifiers – GaussianNB (for features that have a Gaussian distribution, e.g., physical traits – height, etc) – MultinomialNB (when features are frequencies of words) – BernoulliNB (for boolean features) For sentiment analysis: MultinomialNB
15
Creating the model The words contained in the tweets are used as features. They are extracted and weighted using the function create_ngram_model –create_ngram_model uses the function TfidfVectorizer from the package feature_extraction in scikit learn to extract terms from tweets http://scikit- learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVect orizer.html http://scikit- learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVect orizer.html create_ngram_model uses MultinomialNB to learn a classifier – http://scikit- learn.org/stable/modules/generated/sklearn.naive_bayes.Multinomial NB.html http://scikit- learn.org/stable/modules/generated/sklearn.naive_bayes.Multinomial NB.html The function Pipeline of scikit-learn is used to combine the feature extractor and the classifier in a single object (an estimator ) that can be used to extract features from data, create (‘fit’) a model, and use the model to classify – http://scikit- learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html http://scikit- learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
16
Tweet term extraction & classification Extract features and weights them Naïve Bayes classifier Creates Pipeline
17
Training and evaluation The function train_model – Uses a method from the cross_validation library in scikit-learn, ShuffleSplit, to calculate the folds to use in cross validation – At each iteration, the function creates a model using fit, then evaluates the results using score
18
Creating a model Identifies the indices in each fold Trains the model
19
Execution
20
Optimization The program above uses the default values of the parametes for TfidfVectorizer and MultinomialNB In text analytics it’s usually easy to build a first prototype, but lots of experimentation is needed to achieve good results Alternative choices for TfidfVectorizer : – Using unigrams, bigrams, trigrams ( Ngrams parameter) – Removing stopwords ( stop_words parameter) – Using binomial format of counts Alternative choices for MultinomialNB : – Which type of SMOOTHING to use
21
Smoothing Even a very large corpus remains a limited sample of language use, so many words even of common use are not found – Problem particularly common with tweets where a lot of ‘creative’ use of words found Solution: SMOOTHING – distribute the probability so that every word gets some Most used: ADD ONE or LAPLACE smoothing
22
Optimization Looking for the best values for the parameters is a standard operation in machine learning Scikit-learn, like Weka and similar packages, provides a function (GridSearchCV) to explore the results that can be achieved with different parameter configurations
23
Optimizing with GridSearchCV Note the syntax to specify the values of the parameters Use F metric to evaluate Which smoothing function to use
24
Second Script Open the file: 02_tuning.py (but do not run it yet!!) Start an IDLE window
25
Additional improvements: normalization, preprocessing Further improvements may be possible by doing some form of NORMALIZATION
26
Example of normalization: emoticons
27
Normalization: abbreviations
28
Adding a preprocessing step to TfidfVectorizer
29
Other possible improvements Using NLTK’s POS tagger Using a sentiment lexicon such as SentiWordNet – http://sentiwordnet.isti.cnr.it/download.php http://sentiwordnet.isti.cnr.it/download.php – (in the data/ directory)
30
Third Script Open and run the file: 03_clean.py (Start an IDLE window)
31
Overall results
32
TO LEARN MORE
33
SCIKIT-LEARN
34
NLTK http://www.nltk.org/book
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.