Download presentation
Presentation is loading. Please wait.
Published byLaura Oliver Modified over 9 years ago
1
Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität Frankfurt am Main E-Mail: l.ahlborn@em.uni-frankfurt.de
2
Tokens and Types Distribution in TITUS Outline TITUS Resource Data Peculiarities of TITUS texts Tokens and Types calculation in TITUS Resources Metadata for Tokens and Types distribution Корпусная лингвистика 201326.06.20132
3
Tokens and Types Distribution in TITUS TITUS Resource Data TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) http://titus.uni-frankfurt.de Корпусная лингвистика 201326.06.2013 A token represents the concrete occurrence of the linguistic unit, and in a type, tokens associated with each other are bundled. 3 TITUS includes currently 660 texts in 55 languages, more than 30 Mio. tokens
4
Tokens and Types Distribution in TITUS TITUS Data Корпусная лингвистика 201326.06.2013 http://www.clarin.eu/node/1512 Added by J. Gippert, R. Mittmann 4
5
Tokens and Types Distribution in TITUS TITUS Search Engine TITUS Search Engine does not determine the number of tokens in the concrete text, but the number of quotations of the word. Корпусная лингвистика 201326.06.20135
6
Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Gothic Biblia Gothica contains additional parallel passages in Latin and Greek. Корпусная лингвистика 201326.06.2013 Biblia Gothica (http://titus.uni-frankfurt.de/texte/etcs/germ/got/gotnt/gotnt.htm). 6
7
Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Church Slavonic Old Church Slavonic texts are represented in two ways: in the Glagolitic alphabet – original form of the text – and in Cyrillic one. Корпусная лингвистика 201326.06.2013 Codex Marianus (http://titus.uni-frankfurt.de/texte/etcs/slav/aksl/marianus/maria.htm). 7
8
Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Polish Old Polish texts contain a simultaneous display of editions that have arisen at different times. Корпусная лингвистика 201326.06.2013 Kazania Świętokrzyskie (http://titus.uni-frankfurt.de/texte/etcs/slav/apoln/ kazania/kazan.htm). 8
9
Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Ossetian The Ossetian Nart epic is represented in Latinica und in the advanced Cyrillic. Корпусная лингвистика 201326.06.2013 Ossetian: Nart epic (http://titus.uni-frankfurt.de/texte/etcs/iran/niran/oss/ nart/nart.htm). 9
10
Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Russian-Low German Tönnies Fenne's Manual (17th century) contains at least 9 different languages or language variations. Корпусная лингвистика 201326.06.201310
11
Tokens and Types Distribution in TITUS Peculiarities of TITUS texts: Old Prussian Корпусная лингвистика 201326.06.2013 Old Prussian corpus consists of at least 21 different languages or language variants (Old Prussian, Old Lithuanian, Latin, Gothic, Old Low German, Old High German). 11
12
Tokens and Types Distribution in TITUS Creation A digitized source consists not only of a source language words, but contains various information which does not belong originally to the document: numbers, tags, punctuation marks, edition information etc. Корпусная лингвистика 201326.06.2013 $zeile =~ s/\d*\s+\x{003C}\x86\x87\x84\x{003E}//gi; # $zeile =~ s/\d*\s+ //g; # 12
13
Tokens and Types Distribution in TITUS Examples: Gothic Корпусная лингвистика 201326.06.2013 Gothic Bible. Old Testament Fragments. Total: 1629 tokens und 893 types TokensTypes Gothic420240 Latin572325 Greek627319 13
14
Tokens and Types Distribution in TITUS Examples: Gothic Gothic Bible. New Testament Books. Total: 170215 tokens und 28876 types TokensTypes Gothic611679121 Latin526489036 Greek5640010719 Корпусная лингвистика 201326.06.201314
15
Tokens and Types Distribution in TITUS Examples: Корпусная лингвистика 201326.06.2013 Tönnies Fenne's Manual (17th century) The language of the textbook of spoken Russian consists mainly of Russian in Latin transcription and Low German. 15
16
Tokens and Types Distribution in TITUS Examples: further application Корпусная лингвистика 201326.06.201316
17
Tokens and Types Distribution in TITUS Metadata DC – Dublin Core TEI – Text Encoding Initiative CEI – Corpus Encoding Initiative IMDI – ISLE Meta Data Initiative OLAC – Open Language Archives Community CMDI – Component MetaData Infrastructure Корпусная лингвистика 201326.06.201317
18
Tokens and Types Distribution in TITUS CMDI - Component MetaData Infrastructure Корпусная лингвистика 201326.06.2013 http://www.clarin.eu/cmdi 18
19
Tokens and Types Distribution in TITUS TITUS Metadata: HTML Format TITUS Texts: Biblia gothica: Frame Корпусная лингвистика 201326.06.201319
20
Tokens and Types Distribution in TITUS New Metadata Set for TITUS Корпусная лингвистика 201326.06.201320 * Namevorhanden *Authornew *ProjectContactNameexisting *ProjectContactAddressexisting *ProjectContactEmailexisting *ProjectContactOranisationexisting *ProjectDescriptionexisting *Resource.Languageneu *Resource.ResourceLinkexisting *Resource.Access.Availabilityexisting *Resource.Access.Dateexisting *Resource.Access.Ownerexisting *Resource.Access.Publisherexisting *Resource.Publication.Time.Original.Manuscriptnew *Resource.Publication.Time.Original.Facsimilenew *Resource.Publication.Time.Original.Publishednew *Resource.Publication.Time.Electronicexisting *Resource.Wordcount.General.Tokens*new (CLARIN) *Resource.Wordcount.General.Typesnew *Resource.Wordcount.Language.Tokensnew *Resource.Wordcount.Language.Typesnew *Resource.Metadata.Encodingnew
21
Tokens and Types Distribution in TITUS Metadata Example for TITUS – XML CMDI 16.6.2002 1629 Tokens 893 Types Tokens | Types Language 1_General 10 Tokens | 9 Types Language 2_Gothic 420 Tokens | 240 Types Language 4_Latin 572 Tokens | 325 Types Language 5_Greek 627 Tokens | 319 Types Корпусная лингвистика 201326.06.201321
22
Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика 201326.06.201322
23
Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика 201326.06.201323
24
Tokens and Types Distribution in TITUS Metadata for TITUS – Browser Корпусная лингвистика 201326.06.201324
25
Tokens and Types Distribution in TITUS Thank you for your attention! Корпусная лингвистика 201326.06.2013 Links ARBIL (Metadaten-Editor) http://tla.mpi.nl/tools/tla-tools/arbil/ CLARIN http://www.clarin.eu CMDI http://www.clarin.eu/cmdi Dublin Core http://dublincore.org/documents/dcmi-terms/ IMDI http://www.mpi.nl/IMDI/ OLAT http://www.language-archives.org/ TEI http://www.tei-c.org/index.xml TITUS http://titus.uni-frankfurt.de 25
26
Tokens and Types Distribution in TITUS Корпусная лингвистика 201326.06.2013 Old Prussian Corpus Tokens General: 17662 tokens Types General: 8390 types 26
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.