Guanqun Yang, Pencheng Xu, Haiqi Xiao, Yuqing Wang

Guanqun Yang, Pencheng Xu, Haiqi Xiao, Yuqing Wang
Natural Language Processing (Almost) from Scratch Ronan Collobert et al., J. of Machine Learning Research, 2011 Guanqun Yang, Pencheng Xu, Haiqi Xiao, Yuqing Wang

Overview Introduction Benchmark tasks Neural network architecture
Techniques to improve the performance of NN Unlabeled data Multi-task learning Task-specific engineering

1. Introduction Many NLP tasks involve assigning labels in words
“Almost from scratch” Excel on multiple benchmarks Avoid task-specific engineering “Use a single learning system able to discover adequate internal representations”

2. Benchmark tasks Sequence labeling task
Assign discrete labels to discrete elements in a sequence Part-of-speech tagging (POS) Chunking (CHUNK) Named entity recognition (NER) Semantic role labeling (SRL)

2.1 POS Tagging Assign tag to each word in the sentence to represent its grammatical role A simplistic version may involve Verb, Noun, Determiner, etc Could be more fine-grained in some systems. Previously solved with traditional machine learning algorithms Averaged perceptron algorithm (used by NLTK) HMM, CRF (previously covered in class) Prone to mistakes when syntactic and semantic ambiguity appears. Example Can of fish. They can fish. Example The old man the boat.

2.2 Chunking (CHUNK) Defining a Chunk grammar Creating a Chunk parser
Output will be a tree which shows all chunks presents in that sentence An example of Noun-Phrase Chunker Sentence = [(“a”, “DT”), (“beautiful”, “JJ”), (“young”, “JJ”), (“lady”, “NN”), (“is”, ”VBP”), (“walking”, “VBP”), (“on”, “IN”), (“the”, “DT”), (“pavement”, “NN”),] Grammar = “NP: {<DT>?<JJ>*<NN>}” What if Grammar is “NP: {<DT>?<JJ>+<NN>}” Chunking: POS tells you whether words are nouns, verbs etc but doesn’t give you any clue about the structure of the sentence. Chunking = shallow parsing, chunking aims at labeling segments of a sentence with syntactic constituents such as noun or verb phrases (NP or VP). In other words, chunking is the identification of parts of speech and short phrases(like noun phrases or verb phrases), Define a regular expression

2.3 Named Entity Recognition
Locate and classify named entity in unstructured text into predefined categories such as person names, organizations, locations, times and so on. Example: Jim bought 300 shares of Acme Corp. in 2006. [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.

2.4 Semantic Role Labeling
Given a predicate (verb phrases), label the semantic roles of words in the same sentence. core arguments: agent/patient/theme/experiencer/beneficiary... adjunctive arguments: time/location/direction/degree/frequency... traditional way: Syntactic parsing → pruning candidate arguments → argument recognition → argument labeling → post processing deep neural network: e.g., bidirectional LSTM RNN, cell w/ dependency tree... ...and more Drawbacks of traditional ways for SRL: Firstly, performances are heavily dependent on feature engineering, which needs domain knowledge and laborious work of feature extraction and selection. (性能依赖于特征工程，需要领域知识和大量的特征提取工作) Secondly, although sophisticated features are designed, the long-range dependencies in a sentence can hardly be modeled. (没有特征能够表示长距离的依赖关系) Thirdly, a specific annotated dataset is often limited in its scalability, but the existence of heterogenous resource, which has very different semantic role labels and annotation schema but related latent semantic meaning, can alleviate this problem. However, traditional methods cannot relate distinct annotation schemas and introduce heterogeneous resource with ease.(无法引入异构资源来解决数据不足的问题) Deep learning: bidirectional LSTM RNN LSTM cell with dependency tree structure (architecture engineering, not feature engineering)

3. Networks | overview The main idea is to transform NLP task into an optimization problem Window Approach Sentence Approach

3. Networks | word→ word feature vector
Lookup table layer word vector word sequence matrix The first layer of our network maps each of these word indices into a feature vector, by a lookup table operation Given a task of interest, a relevant representation of each word is then given by the corresponding lookup table feature vector W is a matrix of parameters, W(1 w) is the wth column of W This matrix can then be fed to further neural network layers

3. Networks | word feature vector → higher level feature
Window Approach Fixed Size Window Look up Table Linear Layer Scoring A window approach assumes the tag of a word depends mainly on its neighboring words. Given a word to tag, we consider a fixed size ksz (a hyper-parameter) window of words around this word. Each word in the window is first passed through the lookup table layer (1) or (2), producing a matrix of word features of fixed size dwrd ×ksz. HARDTANH: We chose a “hard” version of the hyperbolic tangent as non-linearity. It has the advantage of being slightly cheaper to compute SCORING: Finally, the output size of the last layer L of our network is equal to the number of possible tags for the task of interest. Each output can be then interpreted as a score of the corresponding tag

Lookup Table HardTanh Scoring
A window approach assumes the tag of a word depends mainly on its neighboring words. Given a word to tag, we consider a fixed size ksz (feature vector for each word) window of words around this word. Each word in the window is first passed through the lookup table layer (1) or (2), producing a matrix of word features of fixed size dwrd ×ksz. (CONCATENATION)This matrix can be viewed as a dwrd ksz-dimensional vector by concatenating each column vector, which can be fed to further neural network layers.

3. Networks | word feature vector → higher level feature
Sentence approach Address the problem in SRL task-->takes the whole sentence instead of a window of words as input Most layers similar to window approach Additional layer: Convolution layer & Max layer Matrix W same for all windows Convolutional Layer: Max Layer: Window approach fails with SRL, because in SRL the tag of a word depends on the verb (predicate) Incorrect tagging prediction could happen if the predicate falls out of the window So in this case tagging a word requires the consideration of the whole sentence Difference: Convolutional Approach It successively takes the complete sentence, passes it through the lookup table layer (just as we did in the window approach), produces local features around each word of the sentence thanks to convolutional layers, combines these feature into a global feature vector which can then be fed to standard affine layers (4).

3. Networks | training Max likelihood over training data
Stochastic gradient ascent A random sample is drawn from dataset to make parameter update Two loss functions Word-level log-likelihood Sentence-level log-likelihood Some tag pairs are unlikely to occur for grammatical reasons Capture the tag dependency in some tasks

3. Networks | training Same as loss used in multiclass classification
Word-level log-likelihood (WLL) Non-negativity Unity Use log-likelihood to ease the implementation of backpropagation There are 2 types of log-likelihood being utilized: word-level and sentence-level. Given a word and a tag, the word-level log-likelihood is similar to regular softmax, which is the score of the label normalized by the “logadd” of scores of all possible labels. For sentence-level log-likelihood, similar concept of score is introduced for a sentence and a tag sequence pair. However, in order to taking into account the sentence structure, the score here takes in all word-level scores for word-tag pairs along with transition scores A’s representing jumping of tags in successive words. Thanks to associativity and distributivity of “logadd” on the semi-ring, by using a recursive technique (with memorization?), the time complexity of computing “logadd” over all choices of tag sequences given a sentence could be optimized to linear time from exponential time with respect to the sentence length t. Moreover, at reference time, by replacing “logadd” with “max” operation, Viterbi algorithm could be conveniently applied.

3. Networks | training Sentence-level log-likelihood same as WLL
score: (word sequence, tag sequence) pair tag transition score for successive words t-1 & t recursion + memoization, O(K^T) → O(K2T)

3. Networks | supervised benchmark results
NN: neural network WLL: word-level level log-likelihood SLL: sentence-level log-likelihood

3. Networks | glimpse at word embedding space/lookup table
“Ideally, we would like semantically similar words to be close in the embedding space represented by the word lookup table: by continuity of the neural network function, tags produced on semantically similar sentences would be similar.”

4. Techniques to improve the performance of NN
Unlabeled data Multi-task learning Task-specific engineering

4.1 Unlabeled data (word embeddings)
Model performance hinges on initialization of look-up table Use unlabeled data to improve embedding 221 million unlabeled words versus 130 thousand labeled words 1700 times of the unlabeled data than labeled ones System Use window approach architecture to acquire word embedding Use ranking criterion instead of entropy (SLL and WLL previously described) Computing normalization term is demanding even with efficient algorithm Entropy criterion tends to ignore infrequent words 15% of the most common words appear about 90% of the time About ranking criterion: to maximize f_\theta(x) - f_\theta(x(w)), the difference between “positive” example x and “negative” example x(w), with this value being upper-bounded by 1. -Pengcheng

4.1 Unlabeled data | ranking criterion
Same as multiclass hinge loss in SVM Enforce the non-negativity of loss Equivalent to maximize difference (margin) between Score of sequence with true word Score of sequence with replaced word Semantic meaning of individual word is then distinguishable from others in this way.

4.1 Unlabeled data | embeddings
Appealing: “syntactic and semantic properties of the neighbors are clearly related to those of the query word”

4.1 Unlabeled data | semi-supervised benchmark results
initializing the word lookup tables of the supervised networks with the embeddings computed by the language models supervised training stage is free to modify the lookup tables WLL: word-level level log-likelihood SLL: sentence-level log-likelihood NN: neural network LM: language model

4.2 Multi-task learning features trained for one task can be useful for related tasks e.g., language model → word embeddings → supervised networks for NLP more systematic way: models of all tasks are jointly trained with an additional linkage between trainable parameters (to improve generalization error) regularization in joint cost function biasing from common representations shared certain parameters defined a priori Then the authors consider other means for further improvements. Multi-task learning is such a possible way. Models for related tasks of interests are jointly trained with an additional linkage between their trainable parameters in the hope of improving the generalization error. A simple approach for joint training is to have shared parameters among different models. (In contrast, joint decoding considers additional probabilistic dependency paths between models, which therefore defines an implicit supermodel describing all tasks in the same probabilistic framework. Without joint training, the additional dependency paths cannot directly involve unobserved variables. Thus, it is natural to use joint training for discovering common internal representations across different tasks.) In the unified model trained by the authors, word lookup table, first linear layer and part of convolution layer are shared for models of different tasks. The training is achieved by minimizing the loss averaged across all tasks. There are 2 observations: it produces a single unified network that performs well for all these tasks (sentence approach); while it only leads to marginal improvements over using a separate network for each task.

4.2 Multi-task learning | multitasking example
shared parametres: - lookup table - 1st linear layer - (convolution layer) training: minimizing the loss averaged across all tasks

4.2 Multi-task learning | multi-task benchmark results
produce a single unified network that performs well for all these tasks (sentence approach) only lead to marginal improvements over using a separate network for each task

4.3 The temptation: Gazetteers
Traditional NER task Restricted to the Gazetteers provided by the CONLL challenge Not only words but chunks Resulting Performance Explanation

4.3 The temptation: Engineering a Sweet Spot
Standalone Version of the Architecture in terms of 4 tasks “SENNA” Performance

5. Conclusion Unified Framework using Neural Network
Techniques to improve performance of NN Architecture Likelihood Function Window Approach Sentence Approach Word -Level Log-Likelihood Sentence-Level Log-Likelihood

6. Historical remark This paper is regarded as a milestone in solving NLP problem using neural network approach. Bengio et al A Neural Probabilistic Language Model (covered by professor) Collobert et al Natural Language Processing (Almost) from Scratch (covered in this class) Mikolov et al Distributed Representations of Words and Phrases and their Compositionality (aka Word2Vec) (covered by professor)

Q & A Thank you 😄

Guanqun Yang, Pencheng Xu, Haiqi Xiao, Yuqing Wang

Similar presentations

Presentation on theme: "Guanqun Yang, Pencheng Xu, Haiqi Xiao, Yuqing Wang"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Guanqun Yang, Pencheng Xu, Haiqi Xiao, Yuqing Wang

Similar presentations

Presentation on theme: "Guanqun Yang, Pencheng Xu, Haiqi Xiao, Yuqing Wang"— Presentation transcript:

Similar presentations

About project

Feedback