March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.

Slides:



Advertisements
Similar presentations
A Common Standard for Data and Metadata: The ESDS Qualidata XML Schema Libby Bishop ESDS Qualidata – UK Data Archive E-Research Workshop Melbourne 27 April.
Advertisements

Quranic Arabic Corpus Data Mining & Text Analytics By Ismail Teladia & Abdullah Alazwari.
LEARNING TO WRITE IN TWO LANGUAGES Professor Anthony Liddicoat University of South Australia Bilingual Schools Network Camberwell PS, March 2013.
Course: Natural Language Syntax (CS 6998) Spring 2005 Martin JanscheMartin Jansche and Owen RambowOwen Rambow.
Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
1 Linguistics and translation theory Mark Shuttleworth Teaching Translation Swansea, 20 January 2006.
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
CSE111: Great Ideas in Computer Science Dr. Carl Alphonce 219 Bell Hall Office hours: M-F 11:00-11:
Tasks Talk: ULA08 Workshop March 18, 2007 A Talk about Tasks Unified Linguistic Annotation Workshop Adam Meyers New York University March 18, 2008.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 3 Style Sheets: CSS WEB.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Language-specific Issues Czech Jan Hajič Institute of Formal and Applied Linguistics.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
TectoMT two goals of TectoMT –to allow experimenting with MT based on deep- syntactic (tectogrammatical) transfer –to create a software framework into.
Building the Valency Lexicon of Arabic Verbs Viktor Bielický Otakar Smrž LREC 2008, Marrakech, Morocco.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
PDT Grammatemes and Coreference in the PDT 2.0 Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.
Prague Dependency Treebank(s) Workshop at LSA2011, Part II Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science.
Metadata Xiangming Mu. What is metadata? What is metadata? (cont’) Data about data –Any data aids in the identification, description and location of.
Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.
Linguistics & AI1 Linguistics and Artificial Intelligence Linguistics and Artificial Intelligence Frank Van Eynde Center for Computational Linguistics.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax 1 The PDT Morphology and Surface Syntax.
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
WHAT IS LINGUISTICS?. LINGUISTICS IS THE SCIENTIFIC STUDY OF HUMAN NATURAL LANGUAGE.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
SDPL 2001Notes 4: Intro to Stylesheets1 4. Introduction to Stylesheets n Discussed recently: –Programmatic manipulation of (data-oriented) documents n.
Language Data Resources About Corpora. J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Language and Computation Day University of Essex 4 October 2005.
Systematic Parameterized Description of Pro-forms in the Prague Dependency Treebank 2.0 Magda Ševčíková Zdeněk Žabokrtský Institute of Formal and Applied.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
Proper Nouns in Czech Corpora Magda Ševčíková Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
ITS 2.0 in XLIFF 2 FEISGILTT Dublin June 2014 Yves Savourel ENLASO Corporation This presentation was made possible by.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Tectogrammatics 1 PDT: Tectogrammatical Representation Jan Hajič Institute.
1 Lecture 7 Style Sheets: CSS. 2 Motivation HTML markup can be used to represent –Semantics: h1 means that an element is a top-level heading –Presentation:
Semantic annotation of a dialog corpus Silvie Cinková Institute of Formal and Applied Linguistics Charles University in Prague, Czech Republic COMPANIONS.
INTRODUCTION TO APPLIED LINGUISTICS
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
1. Intro and XML Rules Spring Summer 2010 Marcus Bingenheimer TEI Workshop.
An Introduction to Linguistics
Linguistics Linguistics can be defined as the scientific or systematic study of language. It is a science in the sense that it scientifically studies the.
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Artificial Intelligence 2004 Speech & Natural Language Processing
Natural Language Processing (NLP)
Presentation transcript:

March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic

March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 2 Prague Dependency Treebank Introduction Theoretical background Introduction to the Prague Dependency Treebank family of projects PDT Czech PADT (Arabic) PCEDT – parallel Czech – English PEDT (English) Markup Linking scheme Tokenization, sentence boundaries

March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 3 Theoretical background Functional Generative Description From the 1960s “Basic book”: Sgall, Hajicova, Panevova: The Meaning of the Sentence in Its. Semantic and Pragmatic Aspects, ed. by J. L. Mey, Dordrecht:Reidel, 1986 Core language system description Representation suggested Used in rules-based parsers of Czech and English Terminology: FGP => PDT

March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 4 The Prague Dependency Treebank Project (Czech Treebank) PDT v. 0.5 released (JHU workshop) 400k words annotated, unchecked 2001 PDT 1.0 released (LDC): 1.3MW annotated, morphology & surface syntax 2006 PDT 2.0 release (LDC) 0.8MW annotated (55k sentences) the “tectogrammatical layer”  underlying (deep) syntax

March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 5 Related Projects (Treebanks) Prague Czech-English Dependency Treebank WSJ portion of PTB, translated to Czech automatically analyzed English side (PTB), too -> Prague English Dep. Treebank Prague Arabic Dependency Treebank apply same representation to annotation of Arabic surface syntax so far, tectogrammatical in prep. (2008-9) Both have been published in 2004 (LDC) Czech Academic Corpus v. 2.0 (2008) Conversion of 70s’ style annotation to PDT style (0.5 mil.)

March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 6 PDT for Companions Scheme extensions (all layers), but specifically: “lower” side – new layer Speech Reconstruction and Transcription “upper” side – the Tectogrammatical layer Extended coreference, dialog representation Czech Spoken data, dialog annotation English Representation for English Written, spoken, dialog data annotation

March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 7 The Annotation Scheme XML + principles of linear- and tree-based standoff annotation  PML (Prague Markup Language) Layer schemes (Relax NG) PDT/PADT/PEDT: t(ecto), a(nalytic), m(orphology), … English: + phrase-based (p-layer)

Aug. 23, PML/XML Annotation Layers Strictly top-down links w+m+a can be easily “knitted” API for cross-layer access (programming) PML Schema / Relax NG [z and audio layers: used for spoken data (audio as layer “-1”)] LFG analogy: f-struct Φ c-struct z-layer audio BYL BYS ČELO LESA …

9 The Prague Markup Language Example m-layer data, linked to w-layer: manual w#w-tr/_12941_01_00013.fs-s1w4 basic pocházela pocházet_:T VpQW---XR-AA Pointer to w-layer Element m-layer ID

March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 10 XML Markup example (t-layer) a#EnglishA-wsj_0173-s1-t9 a#EnglishA-wsj_0173-s1-t5 a#EnglishA-wsj_0173-s1-t24 maker ACT 1 8 a#EnglishA-wsj_0173-s1-t8 toy PAT 7

March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 11 Tokenization, Segmentation, Sentence Breaks (w-layer) Basic Principles Fully automatic  Will have to be the same for the manually annotated part as well as for other plain-text data  Comes from ASR/transcripts for speech data No access to any linguistic knowledge  …beyond, say, really fail-safe lists of certain types of abbreviations, language identification, coding scheme, and letter classification (upper/lower/…) Standard output markup  unified coding scheme (Unicode/utf-8)

March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 12 Tokenization Words What is a word? (word boundaries) Treatment of hyphens, apostrophes, periods,… Numbers w/digits (normalization)  “periods”, thousand separators  Types of numbers (?)  cardinal, ordinal, money, SSN, tel/fax/…, dates,... Mixed letters and digits Rule of thumb: Split whenever there is the slightest doubt!

March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 13 Tokenization Capitalization Main issues (the “true case”):  Names (not identified yet!)  Start of sentence (don’t know it yet either!)  Typographical conventions (unmarked in most cases) Nontrivial  Headings Rule of thumb: don’t solve it (yet), just keep it & possibly mark it Not in speech data (on w-layer)

March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 14 Sentence Boundaries Chicken and egg problem: To analyze a text linguistically, we need to know sentence boundaries… but… To know sentence boundaries, we would need to have the text linguistically analyzed. Solution: Do something good enough in most cases …maybe redo it later in the manually annotated part For spoken data: Training from reconstructed speech Can use speaker info, silence, etc.