Advisor: Hsin-Hsi Chen Reporter: Chi-Hsin Yu Date: From Word Representations:... ACL2010, From Frequency... JAIR 2010 Representing Word... Psychological Review 2007
Outlines Introduction Word representations Experimental Comparisons (ACL 2010) Chunking, Named Entity Recognition Conclusions 2
Introduction A word representation A mathematical object associated with each word, often a vector. Examples dog: animal, pet, four-leg,... cat: animal, pet, four-leg,... bird: animal, two-leg, fly,... Questions How do we build this matrix? Are there other representations except matrix? Vocabulary 3
Word Representations Categorizing word representations by sources From human Feature list, semantic networks, ontology (WordNet, SUMO, FrameNet,...) From texts Frequency-based Distributional Representation, Latent Semantic Indexing Model-based Clustering (Brown clustering), Latent Dirichlet Allocation, embedding (Neural Language Model, Hierarchical Log-Bilinear model) Operations-based Random indexing (quantum informatics), holographic lexicon 4
Word Representations Some important considerations Dimension size Distributional representations: > 5000 HLBL, random indexing, LSI: <500 Format Vector Network Encoded knowledge/relations/information World knowledge: ontology Word semantics Word similarity/distance/proximity Most important question in word representations What is meaning? 5
Word Representations – Distributional Representations From texts, frequency-based Row-column Token-event Event= Same document, Window-size within 5 words 6
Word Representations – Distributional Representations The event can be some patterns. A door is a part of a house. Token door:house Event is_a_part_of Some procedures applying on the matrix [From Frequency to Meaning: Vector Space Model of Semantics, 2010, JAIR] Preprocess of texts (tokenization, annotation...) Normalization/weighting Smoothing of the matrix (using SVD) Latent meaning, noise reduction, high-order co-occurrence, sparsity reduction 7
Word Representations – Brown Clustering The Brown algorithm is a hierarchical clustering algorithm which clusters words to maximize the mutual information of bigrams (Brown et al., 1992). is a class-based bigram language model. runs in time O(V·K 2 ), where V is the size of the vocabulary and K is the number of clusters. 8
Word Representations – Embedding Collobert and Weston embedding (2008) Neural language model Discriminative and non-probabilistic Hierarchical log-bilinear embedding (HLBL) (2009) Neural language model Distributed representation 9
Experimental Comparisons (ACL2010) Chunking CoNLL-2000 shared task Linear CRF chunker (Sha and Pereira 2003) Data From Penn Treebank, 7936 sentences(training), 1ooo sentences (development) Name Entity Recognitions The regularized averaged perceptron model (Ratinov and Roth 2009) CoNLL03 shared task 204k words for training, 51k words for development, 46K words for testing Evaluating out-of-domain dataset: MUC7 formal run (59K words) 10
Experimental Comparisons— Features ChunkingNER 11
Experimental Comparisons— Results 12
Experimental Comparisons— Results 13
Experimental Comparisons— Results 14
Conclusions Word features can be learned in advance in an unsupervised, task- inspecific, and model-agnostic manner. The disadvantage is that accuracy might not be as high as a semi- supervised method that includes task-specific information and that jointly learns the supervised and unsupervised tasks. (Ando & Zhang, 2005-ASO; Suzuki & Isozaki, 2008; Suzuki et al., 2009) Future work is inducing phrase representations. 15
Q&A 16