Learning Character Level Representation for POS Tagging

Slides:



Advertisements
Similar presentations
A brief review of non-neural-network approaches to deep learning
Advertisements

Patch to the Future: Unsupervised Visual Prediction
Deep Learning in NLP Word representation and how to use it for Parsing
Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, Bing Qin
Distributed Representations of Sentences and Documents
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Ronan Collobert Jason Weston Leon Bottou Michael Karlen Koray Kavukcouglu Pavel Kuksa.
Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Ganesh J1, Manish Gupta1,2 and Vasudeva Varma1
A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning Ronan Collobert Jason Weston Presented by Jie Peng.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
DeepWalk: Online Learning of Social Representations
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
Graph-based Dependency Parsing with Bidirectional LSTM Wenhui Wang and Baobao Chang Institute of Computational Linguistics, Peking University.
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Convolutional Sequence to Sequence Learning
Convolutional Neural Network
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deep Feedforward Networks
Deep Learning for Bacteria Event Identification
Summary of “Efficient Deep Learning for Stereo Matching”
Deep Learning Amin Sobhani.
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Recurrent Neural Networks for Natural Language Processing
Relation Extraction CSCI-GA.2591
Neural Machine Translation by Jointly Learning to Align and Translate
Restricted Boltzmann Machines for Classification
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Intro to NLP and Deep Learning
Intro to NLP and Deep Learning
Multimodal Learning with Deep Boltzmann Machines
Intelligent Information System Lab
Are End-to-end Systems the Ultimate Solutions for NLP?
Deep learning and applications to Natural language processing
Background & Overview Proposed Model Experimental Results Future Work
A brief introduction to neural network
Convolutional Neural Networks for sentence classification
Wei Liu, Chaofeng Chen and Kwan-Yee K. Wong
Vessel Extraction in X-Ray Angiograms Using Deep Learning
Word Embedding Word2Vec.
Word embeddings based mapping
Word embeddings based mapping
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Neural Speech Synthesis with Transformer Network
RCNN, Fast-RCNN, Faster-RCNN
实习生汇报 ——北邮 张安迪.
Ali Hakimi Parizi, Paul Cook
Word embeddings (continued)
Attention for translation
Deep Interest Network for Click-Through Rate Prediction
Neural Joint Model for Transition-based Chinese Syntactic Analysis
CSE 291G : Deep Learning for Sequences
Introduction to Sentiment Analysis
Introduction to Neural Networks
Presented by: Anurag Paul
Sentiment Classification
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Bidirectional LSTM-CRF Models for Sequence Tagging
Guanqun Yang, Pencheng Xu, Haiqi Xiao, Yuqing Wang
Neural Machine Translation by Jointly Learning to Align and Translate
CS249: Neural Language Model
An introduction to neural network and machine learning
Presentation transcript:

Learning Character Level Representation for POS Tagging Cıcero Nogueira dos Santos Bianca Zadrozny Presented By Anirban Majumder

Introduction : Distributed Word Embedding Useful technique to capture syntactic and semantic information about words. But for many of the NLP task such as POS tagging, the information about word morphology and shape is important which is not captured in these embeddings. Proposes a deep neural network to learn Character-Level Representation to capture intra-word information.

Char-WNN Architecture joins two word and character level embedding for POS tagging extension of Collobert et al’s(2011) NN architecture Uses a convolutional layer to extract char-embedding for word of any size

Char-WNN Architecture Input: Fixed sized window of words centralized in target word Output: For each word in a sentence, the NN gives each word a score for each tag τ ∈ T (Tag Set)

Word and Char-Level Embedding word is from a fixed size vocabulary Vwrd and every word w ∈ Vchr , a fixed size of character vocabulary Two embedding matrix are used: Wwrd ∈ Rd wrd ×|V wrd| Wchr ∈ R d chr ×|V chr|

Word and Char-Level Embedding Given a sentence with n words {w1,w2,...,wn}, each word wn is converted into a vector representation un as follows: un= { rwrd ; rwch } where rwrd ∊ Rdwrd is the word level embedding and rwch ∊ Rclu is the character level embedding

Word and Char-Level Embedding

Word and Char-Level Embedding

Char-Level Embedding : Details Produces local features around each character of the word combines them to get a fixed size character-level embedding Given a Word w composed of M characters {c1,c2,...,cM}, each cM is transformed into a character embedding rmchr . Them input to the convolution layer is the sequence of character embedding of M characters.

Char-Level Embedding : Details window of size kchr (character context window) of successive windows in the sequence of {rchr1 , rchr2 , ..., rchrM } The vector zm (concatenation of character embedding m)for each character embedding is defined as follows : zm = (rchrm−(kchr−1)/2 , ..., rchrm+(kchr−1)/2 )T

Char-Level Embedding : Details Convolutional layer computer the jth element of the character embedding rwch of the word w as follows: [rwch]j = max1<m<M[W0zm + b0]j Matrix W0 is used to extract local features around each character window of the given word Global fixed-sized feature vector is obtained using max operator over each character window

Char-Level Embedding : Details Parameter to be learned : Wchr, W0 and b0 Hyper-parameters : dchr : the size of the character vector clu : the size of the convolution unit (also the size of the character-level embedding) kchr : the size of the character context window

Scoring follow Collobert et al.’s (2011) window approach to score all tags T for each word in a sentence the assumption that the tag of a word depends mainly on its neighboring words to compute tag scores for the n-th word in the sentence, we first create a vector xn resulting from the concatenation of a sequence of kwrd embeddings, centralized in the n-th word

xn = (un−(kwrd−1)/2 , ..., un+(kwrd−1)/2)T Scoring the vector xn : xn = (un−(kwrd−1)/2 , ..., un+(kwrd−1)/2)T the vector xn is processed by two NN layers to compute scores : s(xn) = W2 h(W1 xn + b1) + b2 where W1 ∈ Rhl u×k wrd(d wrd+cl u) W2 ∈ R|T|×hl u

Structured Inference : the tags of neighbouring words are strongly dependent prediction scheme that takes into the sentence structure (Collobert et. al, 2011)

Structured Inference : We compute the score for a tag path [t]1N={t1,t2,...,tn} as S ([w]1N,[t]1N,θ) = ∑n=1N(Atn−1,tn + s(xn)tn) s(xn)tn is the score for the tag tn for the word wn Atn-1,tn is a transition score for jumping from tag tn-1 to tag tn θ is the set of all trainable network parameters (Wwrd, Wchr, W0 , b0 , W1 , b1 , W2 , b2 , A)

Network Training : network is trained by minimizing a negative log-likelihood over the training set D, same as Collobert et al.(2011) interpret a sentence score as a conditional probability over a path log p( [t]N1|[w]N1,θ) = S([w]N1,[t]N1,θ) − log(∑X ∀[u]N1∈TN e S([w]N1,[u]N1,θ)) used stochastic gradient descent to minimize the negative log-likelihood with respect to θ

Experimental Setup : POS Tagging Datasets English Datasets SET SENT. TOKENS OOSV OOUV TRAINING 38,219 912,344 0 6317 DEVELOP. 5,527 131,768 4,467 958 TEST 5,462 129,654 3,649 923 WSJ Corpus Portuguese Datasets SET SENT. TOKENS OOSV OOUV TRAINING 42,021 959,413 0 4155 DEVELOP. 2,212 48,258 1360 202 TEST 9,141 213,794 9523 1004 Mac-Morpho Corpus

English POS Tagging Results: SYSTEM FEATURES ACC. ACC.OOSV ACC. OOUV CHARWNN – 97.32 89.86 85.48 WNN CAPS+SUF2 97.21 89.28 86.89 WNN CAPS 97.08 86.08 79.96 WNN SUF2 96.33 84.16 80.61 WNN – 96.13 80.68 71.94 Comparison of different NNs for POS Tagging of the WSJ Corpus

Portuguese POS Tagging Results: SYSTEM FEATURES ACC. ACC.OOSV ACC. OOUV CHARWNN – 97.47 92.49 89.74 WNN CAPS+SUF3 97.42 92.64 89.64 WNN CAPS 97.27 90.41 86.35 WNN SUF3 96.35 85.73 81.67 WNN – 96.19 83.08 75.40 For POS Tagging of the Mac-Morpho Corpus

Results: Most similar words using character-level embeddings learned with WSJ Corpus INCONSIDERABLE 83-YEAR-OLD SHEEP-LIKE DOMESTICALLY UNSTEADINESS 0.0055 INCONCEIVABLE 43-YEAR-OLD ROCKET-LIKE FINANCIALLY UNEASINESS 0.0085 INDISTINGUISHABLE 63-YEAR-OLD FERN-LIKE ESSENTIALLY UNHAPPINESS 0.0075 INNUMERABLE 73-YEAR-OLD SLIVER-LIKE GENERALLY UNPLEASANTNESS 0.0015 INCOMPATIBLE 49-YEAR-OLD BUSINESS-LIKE IRONICALLY BUSINESS 0.0040 INCOMPREHENSIBLE 53-YEAR-OLD WAR-LIKE SPECIALLY UNWILLINGNESS 0.025

Results: Most similar words using word-level embeddings learned using unlabeled English texts INCONSIDERABLE 00-YEAR-OLD SHEEP-LIKE DOMESTICALLY UNSTEADINESS 0.0000 INSIGNIFICANT SEVENTEEN-YEAR-OLD BURROWER WORLDWIDE PARESTHESIA 0.00000 INORDINATE SIXTEEN-YEAR-OLD CRUSTACEAN-LIKE 000,000,000 HYPERSALIVATION 0.000 ASSUREDLY FOURTEEN-YEAR-OLD TROLL-LIKE 00,000,000 DROWSINESS 0.000000 UNDESERVED NINETEEN-YEAR-OLD SCORPION-LIKE SALES DIPLOPIA ± SCRUPLE FIFTEEN-YEAR-OLD UROHIDROSIS RETAILS BREATHLESSNESS -0.00

Results: Most similar words using word-level embeddings learned using unlabeled Portuguese texts GRADAÇÕES CLANDESTINAMENTE REVOGAÇÃO DESLUMBRAMENTO DROGASSE TONALIDADES ILEGALMENTE ANULAÇÃO ASSOMBRO – MODULAÇÕES ALI PROMULGAÇÃO EXOTISMO – CARACTERIZAÇÕES ATAMBUA CADUCIDADE ENFADO – NUANÇAS BRAZZAVILLE INCONSTITUCIONALIDADE ENCANTAMENTO – COLORAÇÕES ˜ VOLUNTARIAMENTE NULIDADE FASCÍNIO –

Future Work : Analyzing the interrelationship between the two embeddings in more details Applying this work to other NLP tasks such as text chunking, NER etc.

Thank You