CSE 291G : Deep Learning for Sequences

Slides:



Advertisements
Similar presentations
ImageNet Classification with Deep Convolutional Neural Networks
Advertisements

Deep Learning in NLP Word representation and how to use it for Parsing
Distributed Representations of Sentences and Documents
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Ronan Collobert Jason Weston Leon Bottou Michael Karlen Koray Kavukcouglu Pavel Kuksa.
Graphical models for part of speech tagging
Hurieh Khalajzadeh Mohammad Mansouri Mohammad Teshnehlab
Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Text Summarization via Semantic Representation 吳旻誠 2014/07/16.
Ganesh J1, Manish Gupta1,2 and Vasudeva Varma1
A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning Ronan Collobert Jason Weston Presented by Jie Peng.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
A Presentation on Adaptive Neuro-Fuzzy Inference System using Particle Swarm Optimization and it’s Application By Sumanta Kundu (En.R.No.
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
DeepWalk: Online Learning of Social Representations
Graph-based Dependency Parsing with Bidirectional LSTM Wenhui Wang and Baobao Chang Institute of Computational Linguistics, Peking University.
Course Outline (6 Weeks) for Professor K.H Wong
Welcome deep loria !.
Sentiment analysis using deep learning methods
Convolutional Sequence to Sequence Learning
Convolutional Neural Network
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
IEEE BIBM 2016 Xu Min, Wanwen Zeng, Ning Chen, Ting Chen*, Rui Jiang*
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deep Learning for Bacteria Event Identification
Deep Learning Amin Sobhani.
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Randomness in Neural Networks
Recurrent Neural Networks for Natural Language Processing
COMP24111: Machine Learning and Optimisation
Relation Extraction CSCI-GA.2591
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Giuseppe Attardi Dipartimento di Informatica Università di Pisa
Intelligent Information System Lab
Basic machine learning background with Python scikit-learn
Natural Language Processing of Knee MRI Reports
Neural networks (3) Regularization Autoencoder
Deep learning and applications to Natural language processing
Max-margin sequential learning methods
Distributed Representation of Words, Sentences and Paragraphs
Convolutional Neural Networks for sentence classification
Learning Character Level Representation for POS Tagging
Grid Long Short-Term Memory
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Word Embedding Word2Vec.
Word embeddings based mapping
Word embeddings based mapping
Lecture 16: Recurrent Neural Networks (RNNs)
Attention.
实习生汇报 ——北邮 张安迪.
Neural networks (3) Regularization Autoencoder
Giuseppe Attardi Dipartimento di Informatica Università di Pisa
Clinically Significant Information Extraction from Radiology Reports
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
Automatic Handwriting Generation
Introduction to Neural Networks
Presented by: Anurag Paul
Keshav Balasubramanian
Bidirectional LSTM-CRF Models for Sequence Tagging
Guanqun Yang, Pencheng Xu, Haiqi Xiao, Yuqing Wang
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

CSE 291G : Deep Learning for Sequences Paper presentation Topic : Named Entity Recognition Rithesh

Outline Named Entity Recognition and its applications. Existing methods Character level feature extraction RNN : BLSTM- CNNs

Named Entity Recognition (NER)

Named Entity Recognition (NER) WHAT ? Named Entity Recognition (NER)

Named Entity Recognition (entity identification, entity chunking & entity extraction) Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc. Ex : Kim bought 500 shares of IBM in 2010.

Named Entity Recognition (entity identification, entity chunking & entity extraction) Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc. Ex : Kim bought 500 shares of IBM in 2010.

Named Entity Recognition (entity identification, entity chunking & entity extraction) Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc. Ex : Kim bought 500 shares of IBM in 2010. Person name Organization Time

Named Entity Recognition (entity identification, entity chunking & entity extraction) Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc. Ex : Kim bought 500 shares of IBM in 2010. Importance of locating named entity in a sentence : Ex : Kim bought 500 shares of Bank of America in 2010. Person name Organization Time

Named Entity Recognition (NER) WHAT ? Named Entity Recognition (NER) WHY ?

Applications of NER Content Recommendations Customer support Classifying content for news providers Efficient Searching algorithms QA Machine Translation Systems Automatic Summarization system

Named Entity Recognition (NER) WHAT ? Named Entity Recognition (NER) WHY ? HOW ?

Approaches : ML Classification techniques (Ex : SVM, Perceptron model, CRF(Conditional Random Fields)) Drawback : Requires Hand-crafted features Neural Network Model (By Collobert – Natural Language Processing (almost) from scratch) Drawbacks : (i) Simple Feedforward NN with fixed window size (ii) Depends solely on word embeddings & fails to exploit character level features – prefix, suffix etc. RNN : LSTM variable length input and long term memory First proposed by Hammerton in 2003

RNN : LSTM Overcome drawbacks of existing system Account for variable length input and long term memory Fails to handle cases in which the ith word of a sentence(S) depends on words at positions greater than ‘i’ in S. Ex : Teddy bears are on sale. Teddy Roosevelt was a great president.

RNN : LSTM Overcome drawbacks of existing system Account for variable length input and long term memory Fails to handle cases in which the ith word of a sentence(S) depends on words at positions greater than ‘i’ in S. Ex : Teddy bears are on sale. Teddy Roosevelt was a great president. SOLUTION : Bi-directional LSTM (BLSTM) - Captures Information from the past and from the future.

RNN : LSTM Overcome drawbacks of existing system Account for variable length input and long term memory Fails to handle cases in which the ith word of a sentence(S) depends on words at positions greater than ‘i’ in S. Ex : Teddy bears are on sale. Teddy Roosevelt was a great president. SOLUTION : Bi-directional LSTM (BLSTM) - Captures Information from the past and from the future. Fails to exploit character level features

Techniques to capture character level features Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS. Ling (2015) proposed a model for character level feature extraction using BLSTM for POS.

Techniques to capture character level features Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS. Ling (2015) proposed a model for character level feature extraction using BLSTM for POS. CNN or BLSTM?

Techniques to capture character level features Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS. Ling (2015) proposed a model for character level feature extraction using BLSTM for POS. CNN or BLSTM? BLSTM did not perform significantly better than CNN and also, BLSTM is computationally more expensive to train.

Techniques to capture character level features Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS. Ling (2015) proposed a model for character level feature extraction using BLSTM for POS. CNN or BLSTM? BLSTM did not perform significantly better than CNN and also, BLSTM is computationally more expensive to train. BLSTM : Word level feature extraction CNN : Character level feature extraction

Named Entity Recognition with Bidirectional LSTM-CNNs Jason P.C. Chiu, Eric Nichols (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4, 357-370. Inspired by : Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011b. Natural language processing (almost) from scratch. The journal of Machine Learning Research, 12:2493-2537.pages 25-33. Cicero Santos, Victor Guimaraes. 2015. Boosting named entity recognition with neural character embeddings. Proceedings of the fifth Named Entities Workshop,

Reference paper : Boosting NER with Neural Character Embeddings CharWNN deep neural network – uses word and character level representations(embeddings) to perform sequential classification. HAREM I : Portuguese SPA CoNLL-2002 : Spanish CharWNN extends Collobert et al.’s (2011) neural network architecture for sequential classification by adding a convolutional layer to extract character-level representations.

CharWNN Input : Sentence Output : For each word in the sentence a score for each class

CharWNN Input : Sentence Output : For each word in the sentence a score for each class S : <w1, w2, .. wN>

CharWNN Input : Sentence Output : For each word in the sentence a score for each class S : <w1, w2, .. wN> wn un =[rwrd; rwch] un

CharWNN Input : Sentence Output : For each word in the sentence a score for each class S : <w1, w2, .. wN> wn un =[rwrd; rwch] un

CNN for character embedding

CNN for character embedding W : <c1, c2, ..cM>

CNN for character embedding W : <c1, c2, ..cM>

CNN for character embedding W : <c1, c2, ..cM> Matrix vector operation with window size k

CNN for character embedding W : <c1, c2, ..cM> Matrix vector operation with window size k

CNN for character embedding W : <c1, c2, ..cM> Matrix vector operation with window size k rwch

CharWNN Input : Sentence Output : For each word in the sentence a score for each class S : <w1, w2, .. wN> wn un =[rwrd; rwch] un <u1, u2, .. uN> rwch

CharWNN Input to convolution layer : <u1, u2, .. uN>

Two Neural Network layers CharWNN Input to convolution layer : <u1, u2, .. uN> Two Neural Network layers

Two Neural Network layers CharWNN Input to convolution layer : <u1, u2, .. uN> For a Transition score matrix Atu Two Neural Network layers =

Network Training for CharWNN CharWNN is trained by minimizing the negative log-likelihood over the training set D. Interpret the sentence score as a conditional probability over a path (the score is exponentiated and normalized with respect to all possible paths) Stochastic gradient descent (SGD) to minimize the negative log-likelihood with respect to

Embeddings Word level Embedding : For Portuguese NER, the world level embeddings previously trained by Santos, 2004 was used. And for Spanish, Spanish wikipedia was used. Character level Embedding : Unsupervised learning of character level embeddings was NOT performed. The character level embeddings are initialized by randomly sampling each value from an uniform distribution.

Corpus : Portuguese & Spanish

Hyperparameters

Comparison of different NNs for the SPA CoNLL-2002 corpus

Comparison of different NNs for the SPA CoNLL-2002 corpus Comparison with the state-of-the-art for the SPA CoNLL-2002 corpus

Comparison of different NNs for the HAREM I corpus Comparison with the State-of-the-art for the HAREM I corpus

BLSTM : Word level feature extraction Chiu, J. P., & Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4, 357-370. BLSTM : Word level feature extraction CNN : Character level feature extraction

Character Level feature extraction

Word level feature extraction

Word level feature extraction

Embeddings Word embeddings : 50 dimensional word embeddings released by Collobert (2011b) : Wikipedia & Reuters RCV-I corpus. Also, Stanford’s Glove and Google’s word2vec. Character embeddings : randomly initialized lookup table with values drawn from a uniform distribution with range [-0.5, 0.5] to output a character embedding of 25 dimensions.

Additional Features Additional word level features : Capitalization feature : allCaps, upperInitial, lowercase, mixedCaps, noinfo. Lexicons : SENNA and DBpedia

Training and Inference Implementation : torch7 library Initial state of LSTM set to zero vectors. Objective : Maximize sentence level log-likelihood The objective function and its gradient can be efficiently computed by Dynamic programming. Viterbi algorithm is used to find the optimal tag sequence [ i ]T that maximizes : Learning : Training was done by mini-batch stochastic gradient descent (SGD) with a fixed learning rate, and each mini-batch consists of multiple sentences with same number of tokens.

Results

Results : F1 scores of BLSTM and BLSTM-CNN with various addition features ( emb : Collobert word embeddings, Char : character type feature, caps : capitalization feature, Lex : lexicon feature )

Results : Word embeddings

Results : Various dropout values

Questions to discuss Why BLSTM-CNNs is the best choice? Is the proposed model Language independent? Is it a good idea to use additional features( Capitalization, prefix, suffix etc.) ? Possible Future Works..

Thank you!!