Learning Character Level Representation for POS Tagging Cıcero Nogueira dos Santos Bianca Zadrozny Presented By Anirban Majumder
Introduction : Distributed Word Embedding Useful technique to capture syntactic and semantic information about words. But for many of the NLP task such as POS tagging, the information about word morphology and shape is important which is not captured in these embeddings. Proposes a deep neural network to learn Character-Level Representation to capture intra-word information.
Char-WNN Architecture joins two word and character level embedding for POS tagging extension of Collobert et al’s(2011) NN architecture Uses a convolutional layer to extract char-embedding for word of any size
Char-WNN Architecture Input: Fixed sized window of words centralized in target word Output: For each word in a sentence, the NN gives each word a score for each tag τ ∈ T (Tag Set)
Word and Char-Level Embedding word is from a fixed size vocabulary Vwrd and every word w ∈ Vchr , a fixed size of character vocabulary Two embedding matrix are used: Wwrd ∈ Rd wrd ×|V wrd| Wchr ∈ R d chr ×|V chr|
Word and Char-Level Embedding Given a sentence with n words {w1,w2,...,wn}, each word wn is converted into a vector representation un as follows: un= { rwrd ; rwch } where rwrd ∊ Rdwrd is the word level embedding and rwch ∊ Rclu is the character level embedding
Word and Char-Level Embedding
Word and Char-Level Embedding
Char-Level Embedding : Details Produces local features around each character of the word combines them to get a fixed size character-level embedding Given a Word w composed of M characters {c1,c2,...,cM}, each cM is transformed into a character embedding rmchr . Them input to the convolution layer is the sequence of character embedding of M characters.
Char-Level Embedding : Details window of size kchr (character context window) of successive windows in the sequence of {rchr1 , rchr2 , ..., rchrM } The vector zm (concatenation of character embedding m)for each character embedding is defined as follows : zm = (rchrm−(kchr−1)/2 , ..., rchrm+(kchr−1)/2 )T
Char-Level Embedding : Details Convolutional layer computer the jth element of the character embedding rwch of the word w as follows: [rwch]j = max1<m<M[W0zm + b0]j Matrix W0 is used to extract local features around each character window of the given word Global fixed-sized feature vector is obtained using max operator over each character window
Char-Level Embedding : Details Parameter to be learned : Wchr, W0 and b0 Hyper-parameters : dchr : the size of the character vector clu : the size of the convolution unit (also the size of the character-level embedding) kchr : the size of the character context window
Scoring follow Collobert et al.’s (2011) window approach to score all tags T for each word in a sentence the assumption that the tag of a word depends mainly on its neighboring words to compute tag scores for the n-th word in the sentence, we first create a vector xn resulting from the concatenation of a sequence of kwrd embeddings, centralized in the n-th word
xn = (un−(kwrd−1)/2 , ..., un+(kwrd−1)/2)T Scoring the vector xn : xn = (un−(kwrd−1)/2 , ..., un+(kwrd−1)/2)T the vector xn is processed by two NN layers to compute scores : s(xn) = W2 h(W1 xn + b1) + b2 where W1 ∈ Rhl u×k wrd(d wrd+cl u) W2 ∈ R|T|×hl u
Structured Inference : the tags of neighbouring words are strongly dependent prediction scheme that takes into the sentence structure (Collobert et. al, 2011)
Structured Inference : We compute the score for a tag path [t]1N={t1,t2,...,tn} as S ([w]1N,[t]1N,θ) = ∑n=1N(Atn−1,tn + s(xn)tn) s(xn)tn is the score for the tag tn for the word wn Atn-1,tn is a transition score for jumping from tag tn-1 to tag tn θ is the set of all trainable network parameters (Wwrd, Wchr, W0 , b0 , W1 , b1 , W2 , b2 , A)
Network Training : network is trained by minimizing a negative log-likelihood over the training set D, same as Collobert et al.(2011) interpret a sentence score as a conditional probability over a path log p( [t]N1|[w]N1,θ) = S([w]N1,[t]N1,θ) − log(∑X ∀[u]N1∈TN e S([w]N1,[u]N1,θ)) used stochastic gradient descent to minimize the negative log-likelihood with respect to θ
Experimental Setup : POS Tagging Datasets English Datasets SET SENT. TOKENS OOSV OOUV TRAINING 38,219 912,344 0 6317 DEVELOP. 5,527 131,768 4,467 958 TEST 5,462 129,654 3,649 923 WSJ Corpus Portuguese Datasets SET SENT. TOKENS OOSV OOUV TRAINING 42,021 959,413 0 4155 DEVELOP. 2,212 48,258 1360 202 TEST 9,141 213,794 9523 1004 Mac-Morpho Corpus
English POS Tagging Results: SYSTEM FEATURES ACC. ACC.OOSV ACC. OOUV CHARWNN – 97.32 89.86 85.48 WNN CAPS+SUF2 97.21 89.28 86.89 WNN CAPS 97.08 86.08 79.96 WNN SUF2 96.33 84.16 80.61 WNN – 96.13 80.68 71.94 Comparison of different NNs for POS Tagging of the WSJ Corpus
Portuguese POS Tagging Results: SYSTEM FEATURES ACC. ACC.OOSV ACC. OOUV CHARWNN – 97.47 92.49 89.74 WNN CAPS+SUF3 97.42 92.64 89.64 WNN CAPS 97.27 90.41 86.35 WNN SUF3 96.35 85.73 81.67 WNN – 96.19 83.08 75.40 For POS Tagging of the Mac-Morpho Corpus
Results: Most similar words using character-level embeddings learned with WSJ Corpus INCONSIDERABLE 83-YEAR-OLD SHEEP-LIKE DOMESTICALLY UNSTEADINESS 0.0055 INCONCEIVABLE 43-YEAR-OLD ROCKET-LIKE FINANCIALLY UNEASINESS 0.0085 INDISTINGUISHABLE 63-YEAR-OLD FERN-LIKE ESSENTIALLY UNHAPPINESS 0.0075 INNUMERABLE 73-YEAR-OLD SLIVER-LIKE GENERALLY UNPLEASANTNESS 0.0015 INCOMPATIBLE 49-YEAR-OLD BUSINESS-LIKE IRONICALLY BUSINESS 0.0040 INCOMPREHENSIBLE 53-YEAR-OLD WAR-LIKE SPECIALLY UNWILLINGNESS 0.025
Results: Most similar words using word-level embeddings learned using unlabeled English texts INCONSIDERABLE 00-YEAR-OLD SHEEP-LIKE DOMESTICALLY UNSTEADINESS 0.0000 INSIGNIFICANT SEVENTEEN-YEAR-OLD BURROWER WORLDWIDE PARESTHESIA 0.00000 INORDINATE SIXTEEN-YEAR-OLD CRUSTACEAN-LIKE 000,000,000 HYPERSALIVATION 0.000 ASSUREDLY FOURTEEN-YEAR-OLD TROLL-LIKE 00,000,000 DROWSINESS 0.000000 UNDESERVED NINETEEN-YEAR-OLD SCORPION-LIKE SALES DIPLOPIA ± SCRUPLE FIFTEEN-YEAR-OLD UROHIDROSIS RETAILS BREATHLESSNESS -0.00
Results: Most similar words using word-level embeddings learned using unlabeled Portuguese texts GRADAÇÕES CLANDESTINAMENTE REVOGAÇÃO DESLUMBRAMENTO DROGASSE TONALIDADES ILEGALMENTE ANULAÇÃO ASSOMBRO – MODULAÇÕES ALI PROMULGAÇÃO EXOTISMO – CARACTERIZAÇÕES ATAMBUA CADUCIDADE ENFADO – NUANÇAS BRAZZAVILLE INCONSTITUCIONALIDADE ENCANTAMENTO – COLORAÇÕES ˜ VOLUNTARIAMENTE NULIDADE FASCÍNIO –
Future Work : Analyzing the interrelationship between the two embeddings in more details Applying this work to other NLP tasks such as text chunking, NER etc.
Thank You