David Mareček and Zdeněk Žabokrtský

Gibbs Sampling with Treenes constraint in Unsupervised Dependency Parsing
David Mareček and Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague September 15, 2011, Hissar, Bulgaria 1

Motivations for unsupervised parsing
We want to parse texts for which we do not have any manually annotated treebanks texts from different domains different languages We want to learn sentence structures from the corpus only What if the structures produced by linguists are not suitable for NLP? Annotations are expensive It’s a challenge: can we beat the supervised techniques in some application? 2

Outline Parser description Sampling constraints Evaluation Conclusions
Priors Models Sampling Sampling constraints Treeness Root fertility Noun-root dependency repression Evaluation on Czech treebank on all 19 CoNLL treebanks from shared task Conclusions 3

Basic features of our approach
Learning is based on Gibbs sampling We approximate probability of a tree by a product of probabilities of individual edges We used only POS tags for predicting a dependency relation but we plan to use lexicalization and unsupervised POS tagging in the future We introduce treeness as a hard constraint in the sampling procedure It allows non-projective edges

Models We use two simple models in our experiments
the parent POS tag conditioned by the child POS tag the edge length (signed distance between the two words) conditioned by the child POS tag

Gibbs sampling We sample each dependency edge independently
50 iterations The rich get richer (self-reinforcing behavior) counts are taken from the history Exchangability we can deal with each edge as it was the last one in the corpus nominators and denominators in the product are exchangable Dirichlet hyperparameters α1 α2 were set experimentally

Basic sampling For each node, sample its parent with respect to the probability distribution The sampling order of the nodes is random Problem: it may create cycles and discontinuous graphs 0.01 0.02 0.05 0.04 0.03 0.05 0.07 ROOT Její dcera byla včera v zoologické zahradě. 7

Treeness constraint In case a cycle is created:
choose one edge in the cycle (by sampling) and delete it take the formed subtree and attach it to one of the remaining nodes (by sampling) 0.02 0.01 0.02 0.04 0.02 0.02 0.05 0.02 ROOT Její dcera byla včera v zoologické zahradě. 8

Root fertility constraint
Individual phrases tend to be attached to the technical root A sentence has usualy only one word (the main verb) that dominate the others We constrain the root fertility to be one If it has more than one child, we do the resampling sample one child that will stay under the root resample parents of other children 0.04 0.02 0.01 0.02 0.05 0.04 0.02 0.03 ROOT Její dcera byla včera v zoologické zahradě. 9

Noun-ROOT dependency repression
Nouns (especially subjects) often substitute verbs in the governing positions. Majority of grammars are verbocentric Nouns can be easily recognized as the most frequent coarse-grained tag category in the corpus We add the following model: This model is useless when an unsupervised POS tagging is used 10

Evaluation measures Evaluation of unsupervised parser on GOLD data is problematic many linguistics decisions must have been done before annotating each corpus how to deal with coordination structures, auxiliary verbs, prepositions, subordinating conjunctions? We use three following measures: UAS (unlabeled attachment score) – standard metric for evaluating dependency parsers UUAS (undirected unlabeled attachment score) – edge direction is disregarded (it is not a mistake if governor and dependent are switched) NED (neutral edge direction, Schwartz et al, 2011) which treats not only a node’s gold parent and child as the correct answer, but also its gold grandparent UAS < UUAS < NED 11

Evaluation on Czech Czech dependency treebank from CoNLL 2007 shared task Punctuation removed max 15-word sentences Configuration UAS UUAS NED Random baseline 12.0 19.9 27.5 LeftChain baseline 30.2 53.6 67.2 RightChain baseline 25.5 52.0 60.6 Base 36.7 50.1 55.1 Base+Treeness 36.2 46.6 50.0 Base+Treeness+RootFert 41.2 58.6 70.8 Base+Treeness+RootFert+NounRootRepression 49.8 62.6 73.0 12

Error analysis for Czech
Many errors are caused by the reversed dependencies preposition – noun subordinating conjunction – verb

Evaluation on 19 CoNLL languages
We have taken the dependency treebanks from CoNLL shared tasks 2006 and 2007 POS tags from the fifth column were used The parsing was run on concatenated trainining and development sets Punctuation was removed Evaluation on the development sets only We compare our results with the state-of-the-art system, which is based on DMV (Spitkovsky et al, 2011) 14

Evaluation on 19 CoNLL languages

Conclusions We introduced a new approach to unsupervised dependency parsing Even though only a couple of experiments were done so far and only POS tags with no lexicalization are used, the results seem to be competitive to the state-of-the-art unsuperrvised parsers (DMV) We have better UAS for 12 languages out of 19 If we do not use noun-root dependency repression, which is useful only with supervised POS tags, we have better scores for 7 languages out of 19

Future work We would like to add: Word fertility model Lexicalization
to model number of children for each node Lexicalization the word forms itself must be useful Unsupervised POS taging some recent experiments show that using word classes instead of supervised POS tags can improve the parsing accuracy 17

Thank you for your attention.
18

David Mareček and Zdeněk Žabokrtský

Similar presentations

Presentation on theme: "David Mareček and Zdeněk Žabokrtský"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

David Mareček and Zdeněk Žabokrtský

Similar presentations

Presentation on theme: "David Mareček and Zdeněk Žabokrtský"— Presentation transcript:

Similar presentations

About project

Feedback