109.05.2015 International Conference “Corpus linguistics – 2013” St. Petersburg, June 25–27, 2013 Roland Mittmann, M.A. Institute of Empirical Linguistics.

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

Linking Etymological Database: A case study in Germanic Christian Chiarcos, Maria Sukhareva Goethe University Frankfurt am Main LDL – 2014, LREC Reykjavik,
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
OneGeology-Europe - the first step to the European Geological SDI INSPIRE Conference 2010, Session Thematic Communities: Geology Krakow, June 24 th 2010.
Galia Angelova Institute for Parallel Processing, Bulgarian Academy of Sciences Visualisation and Semantic Structuring of Content (some.
Extensible Stylesheet Language (XSL) By Example Tony Wat 9 October 2002.
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.
Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty.
Introduction to Old and Middle English: Part I Overview October 28, 2005 Andreas H. Jucker.
Information-Analytical System “Manuscript”: technologies and tools of creation of electronic collections of ancient and medieval documents Victor BARANOV.
Is this text reliable? Criteria for establishing the “original” text.
E-Lit: Historical Overview of IT in English Literature
Young Children Learn a Native English Anat Ninio The Hebrew University, Jerusalem 2010 Conference of Human Development, Fordham University, New York Background:
Template produced at the Graphics Support Workshop, Media Centre Combining the strengths of UMIST and The Victoria University of Manchester Aims The GerManC.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
An innovative platform to allow translation and indexing of internet sites Localization World
The electronic corpus of 17th and 18th century Polish texts (up to 1772) – aims, methods, current state, problems and prospects for development Włodzimierz.
10 December, 2013 Katrin Heinze, Bundesbank CEN/WS XBRL CWA1: DPM Meta model CWA1Page 1.
Background Data validation, a critical issue for the E.S.S.
12 December, 2012 Katrin Heinze, Bundesbank CEN/WS XBRL CWA1: European Filing Rules CWA1Page 1.
Some facets of knowledge management in mathematics Wolfram Sperber (Zentralblatt Math) Patrick Ion (Math Reviews) Facets of Knowledge Organization A tribute.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
Max Planck Institute for Psycholinguistics Tool development report H. Brugman MPI Nijmegen.
Toman, Steinberger, Ježek Searching and Summarizing in a Multilingual Environment Michal Toman, Josef Steinberger, Karel Ježek University of West Bohemia.
Explanation. -Status of linguistics now and before 20 th century - Known as philosophy in the past, now new name – Linguistics - It studies language in.
THE HEBREW BIBLE AS DATA
The Chicago Guide to Writing about Numbers, 2 nd edition. Summarizing a pattern involving many numbers: Generalization, example, exception (“GEE”) Jane.
LEXUS: a web based lexicon tool Jacquelijn Ringersma Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
FUNDAMENTALS OF LEXICOLOGY
POPULATION AND HOUSING CENSUSES IN SLOVAKIA ON THE WEBSITE Miroslav Hudec Pavol Büchler INFOSTAT – Bratislava MSIS Geneva
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Dr. Kristin Bakken, NO 2014 Oddrun Grønvik, NO 2014 Dr. Daniel Ridings, DOK Sept. 7th 2004.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Introducing MorphoLogic to LIRICS Gábor Prószéky MorphoLogic Pázmány Péter Catholic University Faculty.
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
A Gradual Process for Integrating E-learning in a Higher Education Institute © Igor Kanovsky & Rachel “The New Educational Benefits of ICT in.
Dr. Claudia Fabian 27th June 2013 Piloting a National Programme for the Digitisation of Medieval Manuscripts in Germany y.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
INTELLECTUAL RIGHTS AND HISTORIC CORPORA Mark Sandler University of Michigan ICOLC, March, 2003.
RESEARCH DESIGN & CORPUS COMPILATION. Corpus design is intrinsic and a fundamental part of the analysis. It is guided by the RQ and affects the results.
Corpus lexicography in Russia: recent trends and perspectives Maria Khokhlova St.Petersburg State University Philological Faculty
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Volgograd State Technical University Applied Computational Linguistic Society Undergraduate and post-graduate scientific researches under the direction.
National Library of the Czech Republic Integration of digital materials into EDL Adolf Knoll National Library of the Czech Republic Helsinki CENL Workshop.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Software Reuse Course: # The Johns-Hopkins University Montgomery County Campus Fall 2000 Session 4 Lecture # 3 - September 28, 2004.
Annotation by category – ELAN and ISO DCR Han Slöetjes, Peter Wittenburg Max-Planck-Institute for Psycholinguistics LREC,
DocLing2016 Software Tools Peter K. Austin Department of Linguistics SOAS, University of London
Dictionary graphs Duško Vitas University of Belgrade, Faculty of Mathematics.
PAN-European Exploitation of the Results of the Libraries Programme - EXPLOIT German Libraries Institute Berlin EXPLOIT 1 Internal Services.
FACES General Overview ViRR (Virtueller Raum Reichsrecht) Software Solutions Kristina Büchner and Bastien Saquet Contact:Kristina Buechner:
INTRODUCTION TO APPLIED LINGUISTICS
EDLA 627: CONTEMPORARY LITERACIES: ISSUES AND PRACTICES Module 1 Topic 1 An Introduction to Literacy in the 21st Century Professor Kristina Love.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
ECLI and Beyond: Improving online access to court decisions
European Network of e-Lexicography
Unit 7 - Desktop Publishing
Presentation transcript:

International Conference “Corpus linguistics – 2013” St. Petersburg, June 25–27, 2013 Roland Mittmann, M.A. Institute of Empirical Linguistics Goethe University, Frankfurt am Main, Germany Old German and Old Lithuanian: The Creation of Two Deeply-Annotated Historical Text Corpora

Introduction Aim: creation of deeply-annotated corpora of historical language stages Approach: depending on existing resources from previous analyses qualities of the language itself Comparison of approaches: Old German Reference Corpus (OG/OGRC) Old Lithuanian Reference Corpus (OL/OLRC)

Description of the corpora Old German Reference Corpus (Referenzkorpus Altdeutsch) all preserved texts from the oldest stages of German – Old High German and Old Saxon (= Old Low German) – ca. 750 – 1050 CE – ca. 650,000 word tokens cooperation of 3 German universities: 2008 – 2013 – Humboldt University (Berlin) – Goethe University (Frankfurt am Main) – Schiller University (Jena) several subcorpora already searchable online

Description of the corpora OGRC:

Description of the corpora Old Lithuanian Reference Corpus (Senosios lietuvių kalbos korpusas) preserved texts from the oldest stage of Lithuanian – ca – 1800 CE – ca. 10,000,000 word tokens pilot project covering 540,000 word tokens started in 2012 international cooperation – Lithuanian Language Institute (LKI, Vilnius) – Goethe University (Frankfurt am Main) – University of Pisa, Italy use of experiences made with the OGRC due to cooperation in Frankfurt

Description of the corpora Qualities of the texts of both corpora types of texts: – religious and secular texts – prose and poetry – translated/adapted and independently composed texts language: – variation due to diachronic, diatopic and diastratic differences foreign-language source texts and foreign-language words in the texts: – annotation as similar as possible to OG/OL word tokens – comprised in aforementioned word token numbers Old Lithuanian: balanced choice of texts for pilot project

The unequal starting points Divergence from modern languages OL considerably closer to Modern Lithuanian than OG to Modern (High or Low) German – not only due to different age: invention of printing press in 15 th century and spread of written texts  deceleration of transformation pace of European literary languages  moderate language development from OL to Modern Lithuanian (however, large differences in spelling, in OL many variants) vs. extensive mutations in vowel system between OG and Early Modern Times (e.g. reduction of unstressed vowels to schwa/zero)

The unequal starting points  Impacts on availability of resources Old Lithuanian – no historic dictionary of Lithuanian, no OL grammar (but OL dictionaries) – dictionaries and grammars of Modern Lithuanian may be helpful Old German – specific dictionaries and grammars – glossaries for every subcorpus: all attested inflected word forms, related to corresponding lemmata  OLRC: basis for compilation of OL grammar and glossary  OGRC: questioning and amending of existing works

The unequal starting points Digital availability of the texts OG: one printed edition per text digitized by TITUS project in Frankfurt OL: 10 texts in pilot project – 6 on TITUS – 3 adopted from OL database of Lithuanian Language Institute (LKI) – 1: edition being prepared TITUS texts: – structural annotation: e.g., chapters and lines for original document and edition – information can directly be adopted, together with texts

The unequal starting points titus.uni-frankfurt.de

The unequal starting points Referential text version OGRC: – digitized edition as main reference layer – manual addition of original text forms and graphical peculiarities saved for later, only performed by way of example OLRC: – digitized edition extended by version of original manuscripts or prints – detailed representation of amendments  digitization of original documents required

The courses of action: OGRC Pre-annotation digitization of glossaries for the subcorpora into XML format

The courses of action: OGRC Pre-annotation digitization of glossaries for the subcorpora into XML format linking part-of-speech and morphological data of the word forms with the word tokens in the texts: – extraction of data from glossary files – enrichment with additional part-of-speech and morphological information manually extracted from grammars most glossaries give attestations with locations in text  one-to-one-attribution aim of consistent spelling and consistent modern German translation  adaptation of glossary lemmata to standard dictionaries of Old High German and Old Saxon

The courses of action: OGRC Conversion and manual annotation conversion into ELAN format – software by Max Planck Institute for Psycholinguistics, Nijmegen, the Netherlands database structure – with part-of-speech, morphological, lemmatical and structural pre-annotation manual annotation: – amendment of information – dissolution of ambiguities – addition of simple syntactical annotation

The courses of action: OGRC automated creation of standardized version of word tokens – from lemmata plus part-of-speech and morphological data – morphological knowledge of language stages conveyed into Perl program – standard word forms used to detect annotation mistakes by automated comparison with word forms in text edition

The courses of action: OLRC Pre-annotation no glossaries  annotation tool learning from manual annotation required use of Toolbox (by SIL International, Dallas, Texas) – applying expansible dictionaries one dictionary with data of Lemuoklis – morphological analyser, lemmatizer and tagger by the LKI – enriched by semi-manually classified data from dictionaries on OL, Slavic loanwords in OL and Bible names other dictionary with data of Lithuanian language dictionary – retrieval of data on all lemmata in the corpus from its digital version

The courses of action: OLRC Annotation in Toolbox (OLRC)

The courses of action: OLRC lemmatization of word forms of OL texts: if possible, automatic, else manual creation of standardized word forms by Lemuoklis from lemmata, part-of-speech and morphological annotation Modern Lithuanian-English dictionary  lemma translation conveyance of word tokens into standardized spelling: Consistent Changes Program (SIL) – mainly for older texts, specific rules for every single author needed

The courses of action: OLRC Manual annotation and conversion in Toolbox: – joining of texts with Lemuoklis ʼ data – manual disambiguation Toolbox: no chart structure, restriction of amount of annotation layers  transfer of data into ELAN – automated split-up of word forms into graphemes  annotation (also OGRC) – e.g., addition of information on multiword expressions, quotations and glossing of words  conversion into image annotation tool ImAnTo (Frankfurt University) – annotation of facsimiles of original documents – selection of details of images and linking to annotations

The courses of action: OLRC

The courses of action: Parallel processing Tagsets and annotation schemes part-of-speech and morphological annotation: OGRC: Deutsch Diachron Digital Tagset (DDDTS) – adaptation of TIGER Morphology Annotation Scheme for Modern German, based on Stuttgart-Tübingen Tagset (STTS) – DDDTS used as basis for creation of tagset for OL – distinguishing between lemma-specific and record-specific qualities of word tokens language of word tokens according to ISO (goh, osx; olt; lat)

The courses of action: Parallel processing The ANNIS database transfer of subcorpora of both projects into ANNIS database (Potsdam University, Germany) joining of texts with extensive metadata description – developed by Middle High German and OGRC, adapted by OLRC complex search patterns possible, more comfortable search tool in preparation

The courses of action: Parallel processing Representation in the ANNIS database (OGRC)

Conclusion Comparison of approaches for OL and OG work on OLRC benefits from course of action applied for OGRC – in spite of various aspects diverging initially OLRC can use digitized data and tools for Modern Lithuanian – inapplicable for OGRC lack of glossaries for OLRC  additional adaptive annotation tool special approaches required for objectives exceeding those of OGRC – e.g. precise annotation of facsimiles of original documents  however, cooperation advantageous, more time for philological work

Thank you for your attention! Спасибо за внимание! Old German Reference Corpus: