A Statistical Model for Parsing Czech

Slides:



Advertisements
Similar presentations
Greenberg 1963 Some Universals of Grammar with Particular Reference to the Order of Meaningful Elements.
Advertisements

(It’s not that bad…). Error ID  They give you a sentence  Four sections are underlined  E is ALWAYS “No error”  Your job is to identify which one,
Used in place of a noun pronoun.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Automatic Acquisition of Subcategorization Frames for Czech Anoop Sarkar Daniel Zeman.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
Stemming, tagging and chunking Text analysis short of parsing.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Some Advances in Transformation-Based Part of Speech Tagging
COMPOSITION 9 Parts of Speech: Verbs Action Verbs in General  Follow along on Text page 362.  A verb either expresses an action (what something or.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
ESLG 320 Ch. 12 A little grammar language…. Parts of Speech  Noun: a person/place/thing/idea  Verb: an action or a state of being  Adjective: a word.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
PREPOSITIONS A preposition is a word used to show the __________ of a noun or pronoun to some other word in the sentence. relationship Notice how a change.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Parts of Speech Notes. Part of Speech: Nouns  A naming word  Names a person, place, thing, idea, living creature, quality, or idea Examples: cowboy,
12/06/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Parsing Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI.
Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Communicative and Academic English for the EFL Professional.
Supertagging CMSC Natural Language Processing January 31, 2006.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
Parts of Speech By: Miaya Nischelle Sample. NOUN A noun is a person place or thing.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer Sameh Alansary Alexandria University Bibliotheca Alexandrina 4th International.
A Review for ENGL Parts of Speech In English, there are only eight parts of speech. That means that every sentence you read—and write—is composed.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
THE GENITIVE CASE Their Syntactical Classification.
Natural Language Processing Vasile Rus
Descriptive Grammar – 2S, 2016 Mrs. Belén Berríos Droguett
Language Identification and Part-of-Speech Tagging
CSC 594 Topics in AI – Natural Language Processing
Parts of Speech Review.
CSC 594 Topics in AI – Natural Language Processing
Web News Sentence Searching Using Linguistic Graph Similarity
David Mareček and Zdeněk Žabokrtský
Introduction to Corpus Linguistics: Exploring Collocation
Automatic Hedge Detection
Probabilistic and Lexicalized Parsing
Parts of Speech Subjects and Verbs
Parts of Speech Friendly Feud
LING/C SC 581: Advanced Computational Linguistics
Parts of Speech Subjects and Verbs
Statistical NLP: Lecture 9
CSCI 5832 Natural Language Processing
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
Improving Parsing Accuracy by Combining Diverse Dependency Parsers
English parts of speech
PREPOSITIONAL PHRASES
Parts of Speech.
Extracting Recipes from Chemical Academic Papers
Introduction to Text Analysis
A Path-based Transfer Model for Machine Translation
David Kauchak CS159 – Spring 2019
Statistical NLP : Lecture 9 Word Sense Disambiguation
Parts of Speech.
Presentation transcript:

A Statistical Model for Parsing Czech Daniel Zeman 19.8.1998 A Statistical Model for Parsing Czech Daniel Zeman http://www.ms.mff.cuni.cz/~zeman/ A Statistical Model for Parsing Czech

Daniel Zeman 19.8.1998 Basic Idea Read a manually annotated corpus — the treebank. Count the number of times each particular dependency was seen. p(edge([ve], [dveřích])) = p1 p(edge([v], [dveřích])) = p2 p(edge([ve], [dveře])) = p3 p(edge([ve], [dveřím])) = p4 where likely: p1 > p2 > p3 and p4  0 (p is a relative frequency rather than a probability) Corpus of manually annotated texts A Statistical Model for Parsing Czech

Building the tree Simplification: Stack of N best trees. Daniel Zeman 19.8.1998 Building the tree Simplification: Stack of N best trees. In each step, for each tree on the stack, take M best edges that can be added and create M×N new trees. Prune the new trees: take the N best of them. Repeat the above until all words are added and thus the tree is complete. A Statistical Model for Parsing Czech

What is a word? Daniel Zeman 19.8.1998 stan = tent; stanout = standstill, stop; stát = become volební = election (adjective) Number of tags: potentially 3000 (i.e. potentially 3.19×106 different edges). In workshop final training data: 1000, after reduction 400. Number of lemmas (dictionary headwords): Hajič's electronic dictionary contains approx. 150000 before derivation (228000 when distinguishing semantic differences in homonymes) and almost 700000 after derivation. Poldauf's Velký česko-anglický slovník contains 68000 (i.e. 4.6×109 different edges). Siebenschein's two-volume Česko-německý slovník contains 80000 (6.4×109 hran) — but how many of these are idiomatic phrases? Number of word forms: Hajič's dictionary covers roughly 20 000 000! Highest known number of tags per one word: 108. A Statistical Model for Parsing Czech

Ambiguous tags from dictionary Daniel Zeman 19.8.1998 Ambiguous tags from dictionary [solí, NFS7A|NFP2A|VPX3A] Training We don‘t know which lemma is the right one so we take tags from all possible lemmas (avoiding duplicity). We don‘t know which tag combination (dependency) is the right one so we increment the counters of all possible combinations. All combinations together form an occurrence of just one dependency — so the counter can be adjusted by only the combination‘s share of the occurrence (here 1/6). [bílou, AFS41A|AFS71A] A Statistical Model for Parsing Czech

Constraints and Improvements Daniel Zeman 19.8.1998 Constraints and Improvements The dependencies cannot “cross”. The tag set can be reduced. (Not all information is interesting for parsing, so tags can be merged.) Originally the treebank contained over 1200 different tags, after reduction it is only over 400. Additional model for valency: how likely a word (a tag) has a particular number of child nodes? Adjacency: for dependency XY, separate counters are kept for the case of adjacent and non-adjacent words. Direction: separate counter for the case that X is to the left of Y, and for the case that X is to the right of Y. A Statistical Model for Parsing Czech

Crossing dependency example Daniel Zeman 19.8.1998 Crossing dependency example 1.8% dependencies cross 70% trees with no crossing dep 90% trees with one or zero 98% trees with two or less …, protože doba přenosu více závisí na stavu telefonní linky než na rychlosti přístroje A Statistical Model for Parsing Czech

Tag reduction examples Daniel Zeman 19.8.1998 Tag reduction examples Categories to consider part of speech detailed part of speech gender number case gender of possessor number of possessor person tense grade negation voice var Examples of reduction negation does not matter degree of comparison detailed part of speech often does not matter (classes of adjectives, pronouns, numerals etc.)  some merged, other left intact some pronouns and numerals are treated as adjectives or nouns some numerals are treated as adverbs (e.g. “five-times”) present, future, and imperative forms of verbs merged together vocalization of prepositions removed punctuation split into classes A Statistical Model for Parsing Czech

Valency examples R4 (preposition with accusative, e.g. “na”) Daniel Zeman 19.8.1998 Valency examples R4 (preposition with accusative, e.g. “na”) ZSB (sentence root) PRCX3 (reflexive pronoun “se”) VPP1A (verb, present tense, plural, 1st person, e.g. “jedeme”) A Statistical Model for Parsing Czech

Adjacency and edge direction examples Daniel Zeman 19.8.1998 Adjacency and edge direction examples Prepositions are usually adjacent to nouns or to adjectives being a part of a noun phrase. Adjectives are adjacent to nouns. Verbs are NOT adjacent to most of their modifiers. Final punctuation is NOT adjacent to the root. The root and all prepositions take their modifiers from the right. Nouns are modified by adjectives from the left, and by other nouns in genitive case from the right. Numerals find the counted entity on the right. Of course, for many other dependencies both cases are equally likely so that this distinction does not help them. A Statistical Model for Parsing Czech

Different sources of tags and lemmas Daniel Zeman 19.8.1998 Different sources of tags and lemmas from dictionary (ambiguous) manually assigned (not ambiguous, “truth”, not available for testing) automatically assigned by a tagger or a lemmatizer (not ambiguous, round 92% accurate tags and 98% lemmas) A Statistical Model for Parsing Czech

Results with tags from different sources Daniel Zeman 19.8.1998 Results with tags from different sources A Statistical Model for Parsing Czech

Daniel Zeman 19.8.1998 Lexicalization A new model (table, register), dependencies are couples of words rather than tags. Words are either lemmas (dictionary headwords) or word forms. Stay with automatically disambiguated data both for training and parsing. Here the ambiguity is not as high as by tags, the lemmatizer can achieve 98% accuracy. A Statistical Model for Parsing Czech

Lexicalization: lemmas vs. word forms Daniel Zeman 19.8.1998 Lexicalization: lemmas vs. word forms Lemmas: a little ambiguous. Lemmatizer accuracy about 98%. Word forms are not ambiguous at all. Forms are sparser (700K possible lemmas, 20M possible forms). Lemma + tag = form, so tags could be used for backing off from forms. On the contrary, if lemmas are used, they should be always combined with tags. Investigation: how much sparser are forms than lemmas, and lemmas than tags? A Statistical Model for Parsing Czech

13481 sentences, 230450 words Daniel Zeman 19.8.1998 A Statistical Model for Parsing Czech

13481 sentences, 230450 words Daniel Zeman 19.8.1998 In fact, the number of forms is not 58348 but 44868 because the 13481 sentence headings such as “#53” should be collapsed to one “#”. A Statistical Model for Parsing Czech

How to combine lemmas and tags Daniel Zeman 19.8.1998 How to combine lemmas and tags Tags are important — not just a back off. Estimation of the weight from held-out data: postponed to the future work. For now, a manual estimation was used. A Statistical Model for Parsing Czech

Lemma weight influence Daniel Zeman 19.8.1998 Lemma weight influence A Statistical Model for Parsing Czech

Daniel Zeman 19.8.1998 Minor improvements Do not trust lower counts. Dependencies seen five times or less considered unknown. (The number five has been found experimentally.) Unknown lexical dependencies cannot be treated as impossible. Their share of the whole probability does not remain zero, it is donated to the tag dependency instead.  54% Broader search beam when building the tree: N = 50 instead of 5.  55% A Statistical Model for Parsing Czech

Summary of results Daniel Zeman 19.8.1998 A Statistical Model for Parsing Czech