Treebank Troubles Eckhard Bick Southern Denmark University

Slides:



Advertisements
Similar presentations
CODE/ CODE SWITCHING.
Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Features and Unification
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
Sanjukta Ghosh Department of Linguistics Banaras Hindu University.
Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
1 Basic Parsing with Context Free Grammars Chapter 13 September/October 2012 Lecture 6.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
9/8/20151 Natural Language Processing Lecture Notes 1.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
1st Workshop on Natural Language Processing and Human Language Technologies Universidade do Algarve, Faro, Portugal June 16-17, 2010
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi and introduced in Tree-adjoining grammars are somewhat similar to context-free.
What it’s ? “parsing” Parsing or syntactic analysis is the process of analysing a string of symbols, either in natural language or in computer languages,
Intro to Lexing & Parsing CS 153. Two pieces conceptually: – Recognizing syntactically valid phrases. – Extracting semantic content from the syntax. E.g.,
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
TALC Applying some Developments in Corpus Building Technology to Language Teaching and Learning TALC 2006 Paris.
7. Parsing in functional unification grammar Han gi-deuc.
An ICALL writing support system tunable to varying levels of learner initiative Karin Harbusch 1 & Gerard Kempen 2,3 1 University of Koblenz-Landau, Koblenz,
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.
CSA2050 Introduction to Computational Linguistics Parsing I.
MedKAT Medical Knowledge Analysis Tool December 2009.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.
CS 4705 Lecture 7 Parsing with Context-Free Grammars.
Natural Language Processing Lecture 15—10/15/2015 Jim Martin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
UI's for inputting and presenting the metadata of hypermedia documents Kai Kuikkaniemi HUT T
Natural Language Processing Lecture 14—10/13/2015 Jim Martin.
Parsing & Language Acquisition: Parsing Child Language Data CSMC Natural Language Processing February 7, 2006.
Chapter – 8 Software Tools.
JavaScript Introduction and Background. 2 Web languages Three formal languages HTML JavaScript CSS Three different tasks Document description Client-side.
Trouble? Can’t type: F11 Can’t hear & speakers okay or can’t see slide? Cntrl R or Go out & come back in 1 Sridhar Rajappan.
INTRODUCTION TO THE WIDA FRAMEWORK Presenter Affiliation Date.
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark
Natural Language Processing Vasile Rus
Human Computer Interaction Lecture 21 User Support
Search Engine Architecture
Introduction to Parsing (adapted from CS 164 at Berkeley)
Basic Parsing with Context Free Grammars Chapter 13
Improving a Pipeline Architecture for Shallow Discourse Parsing
Structural relations Carnie 2013, chapter 4 Kofi K. Saah.
CSCI 5832 Natural Language Processing
An ICALL writing support system tunable to varying levels
Using GOLD to Tracking L2 Development
Presentation transcript:

Treebank Troubles Eckhard Bick Southern Denmark University

Science or fiction: A treebank for everybody? Treebank uses Descriptive linguistics NLP Teaching Theory implementation Generative grammar Dependency gramm. Constraint grammar Descriptive research Diachronic synchronic speech, genre Statistics Parser development Coverage robustness documentation Parser evaluation Qualitative quantitative versatility/ compatibility Interaction testing Measuring

1. Theory & Teaching Floresta Sintá(c)tica in parallel theory dependent formats: Constraint Grammar, Constituent Grammar, Dependency Grammar (the first 2 are implemented, a prolog program is being writen at VISL to create the latter from the CG-version) Create filter programs for different formalisms within one super-family of theories, e.g. for generative constituent grammar: create word nodes for 1-constituent groups create S:np PRED:vp daughters for finite clauses create AUX constituents replace ==indentation with tabs and brackets etc. Use graphical front-ends for teaching, with user-driven interactive formatting Document both "filterable" and "un-filterable" theory-clashes Offer simplifying filters and user driven long-forms for the tags used

2. Descriptive Linguistics Adapt search tool to user needs (Águia) and/or filter format to other existing tools (XML-tools, Tiger …) e.g. qualitative/quantitative, conditioned multiple searches Mark points of special interest explicitly in the treebank (now e.g. ellipsis, errors, averbal constructions etc.) Problems: What are people's special interests? How to reach the linguists? Balance the data in terms of genre (now only news texts): speech data, dialectal data, fiction and science data …. Balance the data in terms of language variety (Lusitan - Brazilian) Add section for historical Portuguese?

An example STA:cu CJT:fcl =SUBJ:np ==>N:art(’o’ M P) Os ==>N:num(’quatro’ M P) quatro ==>N:adj(’primeiro’ M P) primeiros ==H:n(’tema’ M P) temas =P:v-fin(’destinar’ PR 3P IND) destinam- =ACC:pron-pers(’se’ M 3P ACC) se =PIV:ppa mostrar o papel de Portugal em o mundo CO:conj-c(’e’ ) e CJT:fcl =SUBJ:np ==>N:art(’o’ M S) o ==H:adj(’quinto’ M S) quinto =P:vpé justificado =PASS:pppor a experiência de Port-Aventura (Barcelona)

3. Parser development Sparse data problem Increase "Mata virgem", using a simplified (thus safer) tag set Revise manually only "crucial" tags (cave: parser dependent?) For training of probabilistic / automated learning systems: Fuse tag strings into units (a la CLAWS) Simplify tag set, e.g. as in VISL-lite (only one type of group, only 2 types of group constituents, H[ead] and D[ependent]) Compatibility problem Create user-specific tags from implicit information (e.g. N[oun] from H:adj in noun phrase, VT[ransitive] from Create user-specific layout (jf. 2.)

4. Evaluation & measurability Tag set incompatibility Compile set of core categories (PoS, syntactic function, attachment link) Build set of "unifier" programs to handle tag synonyms Simplify tags to the smallest common denominator (e.g. free adverbials, object adverbials, adverbial predicates, prepositional objects all as ADVL, or N<PRED/APP into D) Translate implicit information (e.g. on the mother) into explicit tags Researcher incompatibility (you know what I mean …) Prior to joint evaluation, cross-revision of to-be-used Floresta-chunks by members of all participating research groups Inter-annotator disagreement Identify "soft categories", with high inter-annotator disagreement, then a) either fuse them into neighbouring "hard categories", or b) ignore them in the evaluation

A call to arms The Floresta Sintá(c)tica may not be the perfect ressource, but it IS a ressource, possibly the only one of its kind for Portuguese (in terms of information richness) Therefore, let's make the most of it If it can't immediately be used for a given purpose, let's create filters and other solutions as discussed before, rather than not use it In order to make the Floresta more palatable to new users, allow them a say in - which data or genre will be used - which information will be incorporated - which tag set or formalism the treebank will be filtered into A sparse data problem is bad, but a sparce linguist problem is worse … So, let's invite everybody to contribute with data, categories & revision