June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
En->Cz MT system based on tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus,
Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Introduction to treebanks Session 1: 7/08/
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Tasks Talk: ULA08 Workshop March 18, 2007 A Talk about Tasks Unified Linguistic Annotation Workshop Adam Meyers New York University March 18, 2008.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Language-specific Issues Czech Jan Hajič Institute of Formal and Applied Linguistics.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
TectoMT two goals of TectoMT –to allow experimenting with MT based on deep- syntactic (tectogrammatical) transfer –to create a software framework into.
Building the Valency Lexicon of Arabic Verbs Viktor Bielický Otakar Smrž LREC 2008, Marrakech, Morocco.
EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
Leonid Iomdin Institute for Information Transmission Problems, Russian Academy of Sciences
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
PDT Grammatemes and Coreference in the PDT 2.0 Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Prague Dependency Treebank(s) Workshop at LSA2011, Part II Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax 1 The PDT Morphology and Surface Syntax.
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
Language Data Resources About Corpora. J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science.
ACL Birds of a Feather Corpus Annotation with Interlingual Content Interlingual Annotation of Multilingual Text Corpora Bonnie Dorr, David Farwell, Rebecca.
Systematic Parameterized Description of Pro-forms in the Prague Dependency Treebank 2.0 Magda Ševčíková Zdeněk Žabokrtský Institute of Formal and Applied.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
Proper Nouns in Czech Corpora Magda Ševčíková Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics.
PDT Grammatemes in the PDT 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Natural Language Processing Lecture 14—10/13/2015 Jim Martin.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Tectogrammatics 1 PDT: Tectogrammatical Representation Jan Hajič Institute.
Named Entities in Czech Texts and Their Processing Magda Ševčíková Zdeněk Žabokrtský ÚFAL MFF UK.
Semantic annotation of a dialog corpus Silvie Cinková Institute of Formal and Applied Linguistics Charles University in Prague, Czech Republic COMPANIONS.
Prague Czech-English Dependency Treebank 2.0 ufal.mff.cuni.cz/pcedt2.0 Silvie Cinková, Marie Mikulová, Jan Štěpánek & professors, annotators and programmers.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Netgraph – a Tool for Searching in the Prague Dependency Treebank 2.0 Defence of the Doctoral Thesis, Prague, September 3 rd, 2008 Author: Mgr. Jiří Mírovský.
Approaches to Machine Translation
[A Contrastive Study of Syntacto-Semantic Dependencies]
WALT: TALK ABOUT MY OWN ROOM.
Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept
Approaches to Machine Translation
CS224N Section 3: Corpora, etc.
The development of PDT 3.0 Introduction to the discussion
Artificial Intelligence 2004 Speech & Natural Language Processing
Owen Rambow 6 Minutes.
Presentation transcript:

June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna Čermáková, Anja Nedoluzhko, Jana Šindlerová, Josef Toman, Zdeněk Žabokrtský

June 6, 20073rd PIRE Meeting2 Outline: ● Functional Generative Description ● Parallel Treebanks ● PCEDT 2.0 – Project Report  tectogrammatical level of annotation  valency treatment  annotation manual for English  interannotator agreement

June 6, 20073rd PIRE Meeting3 Functional Generative Description ● Basic approach for Prague Treebanks  dependency  stratificational description of the language: ● From structure to function (meaning) - 3 layers of annotation:  morphological  analytical (=surface syntax)  tectogrammatical (=“deep“ syntax, semantics)

June 6, 20073rd PIRE Meeting4 Functional Generative Description ● Since 1995: Prague Dependency Treebank (PDT) - > Czech data (1.0 released LDC 2001, 2.0 – LDC 2006) ● The idea of a parallel corpus: English data, Czech data – translated: Prague Czech-English Dependency Treebank (PCEDT) (1.0 released LDC 2004)

June 6, 20073rd PIRE Meeting5 The Idea of a Parallel, Syntactically Annotated Corpus Build an English corpus in the same formalism as PDT (data resource: Wall Street Journal section of Penn Treebank) Translate it into Czech Manual annotations of both parts of the corpus Train tectogrammar-based machine translation

June 6, 20073rd PIRE Meeting6 Phrasal x Dependency Tree Mr. Payson, an art dealer and collector, sold Vincent van Gogh's "Irises" at a Sotheby's auction in November 1987 to Australian businessman Alan Bond.

June 6, 20073rd PIRE Meeting7 Dependency Trees: a-layer = surface syntax t-layer = underlying syntax, semantics It may have been painted instead by a Rubens associate.

June 6, 20073rd PIRE Meeting8 Dependency Trees: a-layer = surface syntax t-layer = underlying syntax, semantics It may have been painted instead of Rubens by a Rubens associate.

June 6, 20073rd PIRE Meeting9 Tectogrammatical Representation (t-tree) Contains: ● syntactic dependency and coordination: edges ● semantic relations: tectogrammatical functors  verb arguments (inner participants) ● semantic ACT, PAT ● syntactic ADDR, ORIG, EFF  free modifications (e.g. TWHEN, LOC, DIR, MANN,CAUS, CPR, ACMP...)  other: rhematizers, idiomatic expressions, foreign phrases... ● valency of the verbs: valency lexicon EngValLex

June 6, 20073rd PIRE Meeting10 Tectogrammatical Representation (t-tree) Contains: ● links to the lower layers ● grammatical (and textual) coreference ● topic-focus articulation

June 6, 20073rd PIRE Meeting11 Building the PCEDT 2.0, the Current Annotation of the English Data work with the corpus data ● input: WSJ texts (PTB), approx sentences (1.2 million words), automatically converted into PDT-like shape – a-layer ● automatic t-layer procession ● manual annotation running (approx trees annotated) ● meanwhile – Czech section annotation of the t-layer launched additional work ● conversion of the PropBank- lexicon into EngVallex (verbs only) ● tools adjustment (TrEd, unified macros for both CZ and ENG annotation) ● interannotator-agreement measuring ● first version of the annotation manual, is being revised ● training of new annotators

June 6, 20073rd PIRE Meeting12 EngValLex ● adaptation of PropBank into the format of PDT-Vallex (Valency lexicon for Czech) ● manual correction ● continuous checking during the annotation ● current version contains only verbs future work on EngValLex: ● defining surface realizations – morphosyntactic characteristics of the semantics roles ● valency of nouns and adjectives

June 6, 20073rd PIRE Meeting13 Annotation Manual = "Annotation of English on the tectogrammatical level: Reference book" ● based on the abbreviated version of the annotation manual for PDT (Czech) ● chapters specific to English data annotation added ● first rough version 1.0.1: April 2007 ● revision in progress ● extensions planned (concurrently with the annotation)

June 6, 20073rd PIRE Meeting14 Interannotator Agreement ● monthly control of the annotation consistency  approx. 30 trees ● measured:  structure: agreement in parent node  functors ● further analysis:  list of unpaired nodes  statistics for diverging functors  elimination of detected annotation divergences at annotator meetings

June 6, 20073rd PIRE Meeting15 Average Interannotator Agreement

June 6, 20073rd PIRE Meeting16 Future goals ● annotation expansion  500 trees/annotator/month  increasing (or at last keeping) the interannotator agreement  training of new annotators ● EngValLex precision ● annotation manual precision and expansion

June 6, 20073rd PIRE Meeting17 Acknowledgements The work on PCEDT project is supported by the grants PIRE ČR ME838 and GA405/06/0589.

June 6, 20073rd PIRE Meeting18 Acknowledgements The work on PCEDT project is supported by the grants PIRE ČR ME838 and GA405/06/0589. Thank you for your attention!