Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Example-Based Treebank Querying Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde CLARIN Sofia,
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Feature Structures and Unification.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
The Language Model in Bulgarian Treebank (BulTreeBank) Petya Osenova (Sofia) , Prague.
The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.
Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
Natural Language Processing - Feature Structures - Feature Structures and Unification.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
Introduction to treebanks Session 1: 7/08/
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 5.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Feature Structures and Unification.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Features and Unification
Chapter 3: Formal Translation Models
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Information Retrieval in Practice
4/20/2017.
CAREERS IN LINGUISTICS OUTSIDE OF ACADEMIA CAREERS IN INDUSTRY.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu
Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi and introduced in Tree-adjoining grammars are somewhat similar to context-free.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
SYNTAX Lecture -1 SMRITI SINGH.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Development of a German- English Translator Felix Zhang TJHSST Computer Systems Research Lab Period 5.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Albert Gatt Corpora and Statistical Methods Lecture 11.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
SynAF:Provo ISO Meeting Thierry Declerck, DFKI GmbH.
CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.
Tokenization & POS-Tagging
CSA2050 Introduction to Computational Linguistics Parsing I.
CPSC 503 Computational Linguistics
MedKAT Medical Knowledge Analysis Tool December 2009.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.
Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba
Supertagging CMSC Natural Language Processing January 31, 2006.
Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
XML Extensible Markup Language
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
10/31/00 1 Introduction to Cognitive Science Linguistics Component Topic: Formal Grammars: Generating and Parsing Lecturer: Dr Bodomo.
XML: Extensible Markup Language
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Natural Language Processing (NLP)
LING/C SC 581: Advanced Computational Linguistics
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra

Outline  TiGer Treebank  TiGer Search

The TiGer Treebank TIGER: LinguisTic Interpretation of a GERman Corpus Institute of Natural Language Processing (IMS) in Stuttgart, Institut für Germanistik in Potsdam, Department of Computational Linguistics and Phonetics in Saarbrücken German treebanks: Verbmobil Corpus (only spoken language), NEGRA Corpus and Tuebingen Treebank (only 20,000 sentences) The need for a large and comprehensive German treebank: – Data for the testing and training of statistically based methods in natural language processing – Basis for empirical language research TIGER Corpus: – First release (mid 2003): 40,000 sentences of newspaper text (Frankfurter Rundschau, full articles) – Second release (X-mas 2005): 50,000 sentences – Together with 20,000 NEGRA sentences comparable to Penn Treebank in size (1,5 million words)

TiGer: Levels of annotation Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HD SBOC HD OAMO ACNK S VP NP PP annotation on word level: part-of-speech, morphology, lemmata node labels: phrase categories edge labels: syntactic functions crossing branches for discontinuous constituency types will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen

TiGer: Annotation method Interactive tagging and parsing Tagging: TnT (97% reliable), Parsing: Cascaded Markov Models (71% reliable), Morphology: TigerMorph Independent annotation by 2 different annotators and comparison => consistency of corpus + improvement of annotation scheme Annotation time: 10 minutes per sentence

TiGer: Annotation formats #BOS %wordtagmorph edgeparent AusgerechnetADJD-- MO502 IggyNEMasc.Nom.Sg PNC500 PopNE*.Nom.Sg PNC500 verkörpertVVFIN3.Sg.Pres.Ind HD503 gesanglichADJD Pos MO503 denARTDef.Masc.Akk.SgNK501 Staatsanwalt NNMasc.Akk.Sg.* NK501.$ #500MPN-- NK502 #501NP-- OA503 #502NP-- SB503 #503S #EOS 37 ● Corpus annotation and storage on the basis of a MySQL database ● TIGER export format in a line-oriented and ASCII based format ● Separate columns for words, part-of-speech tags, morphological information, edge labels and parent labels ● Encoded meta-information on date, source etc.

● TIGER XML document is split up into header and body ● Header contains meta-information on corpus name, date, author etc. and an annotation grammar ● Body: directed acyclic graphs are used as the underlying data model to encode the linguistic annotation ● Element terminals contains the following attributes: word, part-of-speech, morphological tag ● Element nonterminals: information on phrase categories and syntactic functions TiGer: Annotation formats #BOS %wordtagmorph edgeparent AusgerechnetADJD-- MO502 IggyNEMasc.Nom.Sg PNC500 PopNE*.Nom.Sg PNC500 verkörpertVVFIN3.Sg.Pres.Ind HD503 gesanglichADJD Pos MO503 denARTDef.Masc.Akk.SgNK501 Staatsanwalt NNMasc.Akk.Sg.* NK501.$ #500MPN-- NK502 #501NP-- OA503 #502NP-- SB503 #503S #EOS 37 ● Corpus annotation and storage on the basis of a MySQL database ● TIGER export format in a line-oriented and ASCII based format ● Separate columns for words, part-of-speech tags, morphological information, edge labels and parent labels ● Encoded meta-information on date, source etc.

TiGer: Annotation scheme Uses a hybrid framework which combines advantages of dependency grammar and phrase structure grammar Syntactic structures are rather flat and simple in order to reduce the potential for attachment ambiguities (e.g. the distinction between arguments and adjuncts is not expressed in the constituent structure, but encoded by means of syntactic functions) Based on the NEGRA annotation scheme Changes in TIGER: – improvement of linguistic adequacy – extension of linguistic inventory Cross-fertilization of corpus and annotation scheme: annotation and comparison discrepancy between annotation scheme and data changes in annotation scheme, test for operationalization

TiGer: Query tool ● TIGERSearch: query tool for treebanks using TIGER Query Language ● TIGERRegistry: format conversions into TIGER XML and indexing of the annotated corpus ● TIGER Graph Viewer: visualization of query results ● TIGERin: Graphical User Interface to simplify complex queries and to improve accessibility of the query language

TiGer: Query language

Node level: ● Nodes can be described by Boolean expressions over feature-value pairs ● Query: [word="lacht" & pos="VVFIN"]

TiGer: Query language Node relation level: ● Descriptions of two or more nodes are combined by a relation ● Query: [cat="NP"] >RC [cat="S"]

TiGer: Query language Graph description level: ● Boolean expressions over node relations are allowed (without negation) ● Query: ([cat="S"] > [pos="PRELS"]) & ([cat="S"] > [pos="VVFIN"]) ● Variables can be used to express coreference of nodes or feature values ● Query: (#n:[cat="S"] > [pos="PRELS"]) & (#n > [pos="VVFIN"])

For further information (downloads, papers etc.):