Sentiment analysis using deep learning methods Antti Keurulainen 14.2.2017
Sentiment analysis using deep learning methods Two main approaches: Convolutional neural networks (CNN) Recurrent neural networks (RNN), can be enhanced by using LSTM Antti Keurulainen 14.2.2017
Deep Learning One or more hidden layers and the ability to have trainable parameters in these layers An artificial network, that is organized in hierarchical layers, has the capability to build hierarchical representations of the input data Antti Keurulainen 14.2.2017
Convolutional neural network (CNN) Simple example of convolution operation Input, e.g. an image 1 3 5 2 4 6 7 Filter (Kernel) 0.2 0.7 -0.5 Antti Keurulainen 14.2.2017
Convolutional neural network (CNN) Simple example of convolution operation Input, e.g. an image 1 3 5 2 4 6 7 𝑐 1 =f (0.2∗1+0.7∗3 −0.5 ∗6+0.7∗0) = f(−0.7) Note: Bias terms omitted! Feature map f(-0.7) f represents some non-linear activation function Filter (Kernel) 0.2 0.7 -0.5 Antti Keurulainen 14.2.2017
Convolutional neural network (CNN) Simple example of convolution operation Input, e.g. an image 1 3 5 2 4 6 7 𝑐 1 =f (0.2∗1+0.7∗3 −0.5 ∗6+0.7∗0) = f(−0.7) 𝑐 2 =f(0.2∗3+0.7∗5 −0.5 ∗0+0.7∗2) = f(5.5) Note: Bias terms omitted! Feature map f(-0.7) f(5.5) f represents some non-linear activation function Filter (Kernel) 0.2 0.7 -0.5 Antti Keurulainen 14.2.2017
Convolutional neural network (CNN) Simple example of convolution operation Input, e.g. an image 1 3 5 2 4 6 7 𝑐 1 =f (0.2∗1+0.7∗3 −0.5 ∗6+0.7∗0) = f(−0.7) 𝑐 2 =f(0.2∗3+0.7∗5 −0.5 ∗0+0.7∗2) = f(5.5) 𝑐 3 =… Note: Bias terms omitted! Feature map f(-0.7) f(5.5) … f represents some non-linear activation function Filter (Kernel) 0.2 0.7 -0.5 Antti Keurulainen 14.2.2017
Convolutional neural network (CNN) 𝑦 = 𝑘=1 3 𝑖=1 5 𝑗=1 5 𝑥 𝑘𝑖𝑗 𝜃 𝑘𝑖𝑗 After convolution, some other operations are performed such as applying the activation function (nonlinearity) and pooling During training, the values that are used in the filters are updated and gradually learned The parameter sharing concept brings invariance Antti Keurulainen 14.2.2017
Recurrent Neural Network (RNN) Shallow RNN: 𝑥 𝑡−1 𝑥 𝑡 𝑥 𝑡+1 𝑥 𝑡+2 𝑥 𝑡+3 ℎ 𝑡−1 ℎ 𝑡 ℎ 𝑡+1 ℎ 𝑡+2 ℎ 𝑡+3 𝑜 𝑡−1 𝑜 𝑡 𝑜 𝑡+1 𝑜 𝑡+2 𝑜 𝑡+3 . . . U W V 𝐿 𝑡−1 𝐿 𝑡 𝐿 𝑡+1 𝐿 𝑡+2 𝐿 𝑡+3 𝑦 𝑡−1 𝑦 𝑡 𝑦 𝑡+1 𝑦 𝑡+2 𝑦 𝑡+3 Source: Goodfellow, I., Bengio, Y., Courville, A., Deep Learning, Antti Keurulainen 14.2.2017
Recurrent Neural Network (RNN) 𝑥 𝑡−1 𝑥 𝑡 𝑥 𝑡+1 𝑥 𝑡+2 𝑥 𝑡+3 ℎ 1 𝑡−1 ℎ 1 𝑡 ℎ 1 𝑡+1 ℎ 1 𝑡+2 ℎ 1 𝑡+3 𝑜 𝑡−1 𝑜 𝑡 𝑜 𝑡+1 𝑜 𝑡+2 𝑜 𝑡+3 . . . U W1 V1 𝐿 𝑡−1 𝐿 𝑡 𝐿 𝑡+1 𝐿 𝑡+2 𝐿 𝑡+3 𝑦 𝑡−1 𝑦 𝑡 𝑦 𝑡+1 𝑦 𝑡+2 𝑦 𝑡+3 ℎ 2 𝑡−1 ℎ 2 𝑡 ℎ 2 𝑡+1 ℎ 2 𝑡+2 ℎ 2 𝑡+3 W2 V2 Deep RNN example: Antti Keurulainen 14.2.2017
Vanishing gradient problem and LSTM Problem: gradients propagate over many stages, and involves several multiplications of the weight matrix. -> vanishing or exploding gradients 𝑥 𝑡−1 𝑥 𝑡 𝑥 𝑡+1 𝑥 𝑡+2 𝑥 𝑡+3 ℎ 𝑡−1 ℎ 𝑡 ℎ 𝑡+1 ℎ 𝑡+2 ℎ 𝑡+3 𝑜 𝑡−1 𝑜 𝑡 𝑜 𝑡+1 𝑜 𝑡+2 𝑜 𝑡+3 . . . U W V 𝐿 𝑡−1 𝐿 𝑡 𝐿 𝑡+1 𝐿 𝑡+2 𝐿 𝑡+3 𝑦 𝑡−1 𝑦 𝑡 𝑦 𝑡+1 𝑦 𝑡+2 𝑦 𝑡+3 Antti Keurulainen 14.2.2017
ℎ 𝑡 = 𝑡𝑎𝑛ℎ 𝑈 𝑐 𝑥 𝑡 + 𝑊 𝑐 ℎ 𝑡−1 + 𝑏 𝑐 Standard RNN cell ℎ 𝑡 = 𝑡𝑎𝑛ℎ 𝑈 𝑐 𝑥 𝑡 + 𝑊 𝑐 ℎ 𝑡−1 + 𝑏 𝑐 𝒉 𝒕 Vanilla RNN 𝒉 𝒕−𝟏 𝒉 𝒕 𝒕𝒂𝒏𝒉 𝑈 𝑊 𝒙 𝒕 Visualization idea by Christopher Olah Antti Keurulainen 14.2.2017
𝑠 𝑡 = 𝑡𝑎𝑛ℎ 𝑈 𝑐 𝑥 𝑡 + 𝑊 𝑐 ℎ 𝑡−1 + 𝑏 𝑐 𝑠 𝑡 = 𝑓 𝑡 ∘ 𝑠 𝑡−1 + 𝑖 𝑡 ∘ 𝑠 𝑡 𝑠 𝑡 = 𝑡𝑎𝑛ℎ 𝑈 𝑐 𝑥 𝑡 + 𝑊 𝑐 ℎ 𝑡−1 + 𝑏 𝑐 𝑓 𝑡 = 𝜎 𝑈 𝑓 𝑥 𝑡 + 𝑊 𝑓 ℎ 𝑡−1 + 𝑏 𝑓 ℎ 𝑡 = 𝑜 𝑡 ∘𝑡𝑎𝑛ℎ 𝑠 𝑡 𝑖 𝑡 = 𝜎 𝑈 𝑖 𝑥 𝑡 + 𝑊 𝑖 ℎ 𝑡−1 + 𝑏 𝑖 𝑜 𝑡 = 𝜎 𝑈 𝑜 𝑥 𝑡 + 𝑊 𝑜 ℎ 𝑡−1 + 𝑏 𝑜 ℎ 𝑡 LSTM 𝒔 𝒕−𝟏 𝒔 𝒕 X + 𝑖 𝑡 ∘ 𝑠 𝑡 𝑡𝑎𝑛ℎ 𝑓 𝑡 X 𝑜 𝑡 𝑖 𝑡 𝝈 𝑠 𝑡 X 𝝈 𝒕𝒂𝒏𝒉 𝝈 𝑈 𝑓 𝑊 𝑓 𝑈 𝑜 𝑊 𝑜 ℎ 𝑡 ℎ 𝑡−1 𝑈 𝑖 𝑊 𝑖 𝑈 𝑐 𝑊 𝑐 𝑜 𝑡 ∘𝑡𝑎𝑛ℎ 𝑠 𝑡 𝑥 𝑡 Visualization idea by Christopher Olah Antti Keurulainen 14.2.2017
Sentiment analysis Sentiment analysis is a collection of methods with the main intent to observe the opinion or attitude, for example, of a sentence expressed in natural language. Antti Keurulainen 14.2.2017
Sentiment analysis using CNNs Analysis based on Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), 1746–1751. http://aclweb.org/anthology/D/D14/D14-1181.pdf a simple CNN with one layer of convolution on top of word vectors obtained from an unsupervised neural language model. Good results are obtained by using pre-trained word vector. Results are still improved by further training the word vectors for specific tasks. Antti Keurulainen 14.2.2017
Sentiment analysis using CNNs Simple CNN model for sentiment analysis Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Antti Keurulainen 14.2.2017
Sentiment analysis using CNNs Multiple filters sizes (3,4,5) to produce several feature maps (100 of each size) -> magnitude of 0,3 – 0,4 M parameters Max-over-time pooling used to select the most important feature Two input channels used, other with static word vectors and other with trainable vectors Fully connected softmax layer on top to produce probabilities for each class Dropout used in the fully connected layer for regularization, L2 norm gradient clipping for other weights. Early stopping used. Stochastic gradient descent update using Adadelta update rule Pre-trained word2vec used, trained with 100B words from Google news, 300 dimensions Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Antti Keurulainen 14.2.2017
Sentiment analysis using CNNs Models in [Kim2014] CNN-rand; all words are initialized randomly and trained CNN-static; initialized with word2vec used (unknown initialized randomly) and kept static CNN-non-static; intialized with word2vec and trained further CNN-multichannel; Initialized with word2vec, one channel stays static and other channel is further trained Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Antti Keurulainen 14.2.2017
Sentiment analysis using CNNs datasets [Kim 2014] “Movie review data”. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, EMNLP 2002. Binary classification. “Stanford sentiment treebank 1”. Extension of the above. Fine-graned labels added, Socher et al 2013. “Stanford sentiment treebank 2”. Same as above but with neutral removed and binary labels. “Subjectivity dataset”. 5000 subjective and 5000 objective processed sentences. Pang/Lee ACL 2004 TREC question dataset, classifying a question type into 6 classes Customer review dataset. Reviews of various products like cameras, mp3 players etc. Hu & Liu 2004 MPQA dataset. Opinion polarity subtask from MPQA dataset. Wiebe et al 2005 Antti Keurulainen 14.2.2017
Sentiment analysis using CNNs results [Kim 2014] Antti Keurulainen 14.2.2017
Sentiment analysis using RNNs Antti Keurulainen 14.2.2017
Sentiment analysis using RNNs Analysis based on [Wan2015]: Wang, X., Liu, Y., Sun, C., Wang, B., & Wang, X. (2015). Predicting Polarities of Tweets by Composing Word Embeddings with Long Short-Term Memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1343–1353, Beijing, China. Association for Computational Linguistics. http://www.aclweb.org/anthology/P15-1130 Twitter sentiment prediction, using simple RNN or LSTM recurrent network. Antti Keurulainen 14.2.2017
Sentiment analysis using RNNs The word vectors created by co-occurrence statistics are not always suitable for sentiment analysis (e.g. words “good” and “bad” are close in word2vec representations) Sentiments are expressed by phrases instead of individual words -> how to capture the representation of the whole sentence? Additional challenge: Recurrent Neural Network (RNN) has difficulties to maintain longer time dependencies -> LSTM networks It has been shown, that further task-specific training of the pre-trained word vectors help capturing the polarity information of the sentences Antti Keurulainen 14.2.2017
Sentiment analysis using RNNs Basic RNN architecture [Wan2015] RNN architecture that is used in [Wan2015] In [Wan2015], the sentence is expressed by the hidden state of the last time step. Antti Keurulainen 14.2.2017
Sentiment analysis using RNNs RNN-FLT (Recurrent Neural Network with Fixed Lookup-Table): a simple implementation of the recurrent sentiment classifier Forward pass: Backpropagation: f represents sigmoid function, w are the weights, e are the word embeddings, v includes hidden-output weights, t is the time step, T is the last time step. The loss O is calculated by using cross-entropy loss, and training is conducted by stochastic gradient descent (SGD) Antti Keurulainen 14.2.2017
Sentiment analysis using RNNs RNN-TLT (Recurrent Neural Network with Trainable Lookup-Table) and LSTM-TLT: implementations that include further training of the pre-trained word vectors. In LSTM-TLT the classifier uses LSTM blocks instead of regular RNN blocks. Each regular RNN block is replaced by an LSTM block -> much more complicated functionality -> helps to combat the vanishing gradient problem Antti Keurulainen 14.2.2017
Sentiment analysis using RNNs Experiments are run by using Stanford Twitter Sentiment corpus (STS); 800 000 positive and 800 000 negative tweets. Manually labeled test set includes 177 negative and 182 positive tweets. Training set sentiment labels are retrieved from emoticons. 25 dimensional word vectors that were trained with 1.56M tweets from training set using word2vec. Hidden layer size 60. Non-neural classifiers (Naive Bayes, Maximum Entropy, Support Vector Machine) Neural Bag-of-Words, summation of word vectors as input Dynamic Convolutional Neural Network Recursive Autoencoder Models presented in this paper Antti Keurulainen 14.2.2017
Sentiment analysis using RNNs Additional experiment are run using human-labeled dataset SemEval 2013. The dataset has training set of 4099, development set of 735 and test set of 1742 tweets. Fixed word vectors, pre-trained with word2vec using STS dataset, this time 300-dimensions Antti Keurulainen 14.2.2017
Sentiment analysis using RNNs Which words change most when training the pre-trained word2vec vectors? Antti Keurulainen 14.2.2017
Sentiment analysis using RNNs How the sentiment words are moved during training in 2-d space? 20 most negative and 20 most positive words were tracked during training Before tuning After tuning Antti Keurulainen 14.2.2017
Sentiment analysis experiments using python libraries and tensorflow Antti Keurulainen 14.2.2017
Sentiment analysis experiments using python libraries and tensorflow Dataset: IMDb movie review dataset, 25000 labeled reviews in the training set, 25000 unlabeled in the test set. Models: Bag of words with random forest (pandas, numpy, scikit-learn) Word2vec with random forest (pandas, numpy, scikit-learn, gensim) Word2vec with feed forward (pandas, numpy, gensim, tensorflow) Antti Keurulainen 14.2.2017
Sentiment analysis experiments using python libraries and tensorflow Baseline with bag of words and random forest Step 1: Download from Kaggle.com and clean the IMDb movie review dataset import data to pandas frame use BeatifulSoup to remove html tagging use regular expression to remove non-letters convert to lowercase remove stopwords Step 2: create bag of words representations of the individual reviews (sklearn CountVectorizer) Step 3: fit random forest model on training set and run predictions, submit to Kaggle.com for test set accuracy Accuracy 85,6 % Antti Keurulainen 14.2.2017
Sentiment analysis experiments using python libraries and tensorflow Word2vec with random forest Step 1: Download from Kaggle.com and clean the IMDb movie review dataset import data to pandas frame use BeatifulSoup to remove html tagging use regular expression to remove non-letters convert to lowercase Step 2: create word2vec representations of the individual words (gensim word2vec) Step 3: average all word vectors in a review to form one single vector for a review Step 4: fit random forest model on training set and run predictions, submit to Kaggle.com for test set accuracy Accuracy 83,3 % Antti Keurulainen 14.2.2017
Sentiment analysis experiments using python libraries and tensorflow Word2vec with deep learning Step 1: Download from Kaggle.com and clean the IMDb movie review dataset import data to pandas frame use BeatifulSoup to remove html tagging use regular expression to remove non-letters convert to lowercase Step 2: create word2vec representations of the individual words (gensim word2vec) Step 3: average all word vectors in a review to form one single vector for the review Step 4: fit feed forward deep learning model on training set and run predictions, submit to Kaggle.com for test set accuracy Accuracy 87,0 % Antti Keurulainen 14.2.2017
A lot of hyperparameters and other decisions Remove stopwords? (yes for word2vec, no to sentiment analysis) Remove punctuation? (yes) Dimension of word vectors (300) Word2vec window size (10) Downsampling for frequent words (1e-3) Minimum word count for word2vec (40) Deep Learning (DL) number of layers (3) DL width of the hidden layers (300-150-50-2) DL activation functions (Relu) DL use dropout (tried, did not help. -> no) DL Initialization (random uniform between 0 and 1) DL Optimizer (Adam) DL Adam optimizer parameters, learning rate +3 others DL number of training steps DL regularization method and its parameters (none) DL loss function (cross entropy) Antti Keurulainen 14.2.2017
Homework Consider the CNN-based single-channel version of the model architecture presented in [Kim2014]. Consider a scenario where the pre-trained static word vectors have 300 dimensions, and three different filter sizes are used that span over 3, 4 and 5 words. Each filter size size produces 100 feature maps. The feature maps are calculated by using the the formula (2): meaning that the weight matrix is multiplied with the input vectors, bias is added, and the result is applied through non-linearity such as tanh. Then, the feature maps are max-pooled, and these results are connected to final two dimensional output layer in the fully connected manner. Calculate the number of trainable parameters (weights and biases) there are in this model. Antti Keurulainen 14.2.2017