Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra
Outline TiGer Treebank TiGer Search
The TiGer Treebank TIGER: LinguisTic Interpretation of a GERman Corpus Institute of Natural Language Processing (IMS) in Stuttgart, Institut für Germanistik in Potsdam, Department of Computational Linguistics and Phonetics in Saarbrücken German treebanks: Verbmobil Corpus (only spoken language), NEGRA Corpus and Tuebingen Treebank (only 20,000 sentences) The need for a large and comprehensive German treebank: – Data for the testing and training of statistically based methods in natural language processing – Basis for empirical language research TIGER Corpus: – First release (mid 2003): 40,000 sentences of newspaper text (Frankfurter Rundschau, full articles) – Second release (X-mas 2005): 50,000 sentences – Together with 20,000 NEGRA sentences comparable to Penn Treebank in size (1,5 million words)
TiGer: Levels of annotation Im APPRART Dat in nächsten ADJA Sup.Dat. Sg.Neut nahe Jahr NN Dat. Pl.Neut Jahr. $. HD SBOC HD OAMO ACNK S VP NP PP annotation on word level: part-of-speech, morphology, lemmata node labels: phrase categories edge labels: syntactic functions crossing branches for discontinuous constituency types will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen
TiGer: Annotation method Interactive tagging and parsing Tagging: TnT (97% reliable), Parsing: Cascaded Markov Models (71% reliable), Morphology: TigerMorph Independent annotation by 2 different annotators and comparison => consistency of corpus + improvement of annotation scheme Annotation time: 10 minutes per sentence
TiGer: Annotation formats #BOS %wordtagmorph edgeparent AusgerechnetADJD-- MO502 IggyNEMasc.Nom.Sg PNC500 PopNE*.Nom.Sg PNC500 verkörpertVVFIN3.Sg.Pres.Ind HD503 gesanglichADJD Pos MO503 denARTDef.Masc.Akk.SgNK501 Staatsanwalt NNMasc.Akk.Sg.* NK501.$ #500MPN-- NK502 #501NP-- OA503 #502NP-- SB503 #503S #EOS 37 ● Corpus annotation and storage on the basis of a MySQL database ● TIGER export format in a line-oriented and ASCII based format ● Separate columns for words, part-of-speech tags, morphological information, edge labels and parent labels ● Encoded meta-information on date, source etc.
● TIGER XML document is split up into header and body ● Header contains meta-information on corpus name, date, author etc. and an annotation grammar ● Body: directed acyclic graphs are used as the underlying data model to encode the linguistic annotation ● Element terminals contains the following attributes: word, part-of-speech, morphological tag ● Element nonterminals: information on phrase categories and syntactic functions TiGer: Annotation formats #BOS %wordtagmorph edgeparent AusgerechnetADJD-- MO502 IggyNEMasc.Nom.Sg PNC500 PopNE*.Nom.Sg PNC500 verkörpertVVFIN3.Sg.Pres.Ind HD503 gesanglichADJD Pos MO503 denARTDef.Masc.Akk.SgNK501 Staatsanwalt NNMasc.Akk.Sg.* NK501.$ #500MPN-- NK502 #501NP-- OA503 #502NP-- SB503 #503S #EOS 37 ● Corpus annotation and storage on the basis of a MySQL database ● TIGER export format in a line-oriented and ASCII based format ● Separate columns for words, part-of-speech tags, morphological information, edge labels and parent labels ● Encoded meta-information on date, source etc.
TiGer: Annotation scheme Uses a hybrid framework which combines advantages of dependency grammar and phrase structure grammar Syntactic structures are rather flat and simple in order to reduce the potential for attachment ambiguities (e.g. the distinction between arguments and adjuncts is not expressed in the constituent structure, but encoded by means of syntactic functions) Based on the NEGRA annotation scheme Changes in TIGER: – improvement of linguistic adequacy – extension of linguistic inventory Cross-fertilization of corpus and annotation scheme: annotation and comparison discrepancy between annotation scheme and data changes in annotation scheme, test for operationalization
TiGer: Query tool ● TIGERSearch: query tool for treebanks using TIGER Query Language ● TIGERRegistry: format conversions into TIGER XML and indexing of the annotated corpus ● TIGER Graph Viewer: visualization of query results ● TIGERin: Graphical User Interface to simplify complex queries and to improve accessibility of the query language
TiGer: Query language
Node level: ● Nodes can be described by Boolean expressions over feature-value pairs ● Query: [word="lacht" & pos="VVFIN"]
TiGer: Query language Node relation level: ● Descriptions of two or more nodes are combined by a relation ● Query: [cat="NP"] >RC [cat="S"]
TiGer: Query language Graph description level: ● Boolean expressions over node relations are allowed (without negation) ● Query: ([cat="S"] > [pos="PRELS"]) & ([cat="S"] > [pos="VVFIN"]) ● Variables can be used to express coreference of nodes or feature values ● Query: (#n:[cat="S"] > [pos="PRELS"]) & (#n > [pos="VVFIN"])
For further information (downloads, papers etc.):