Download presentation
Presentation is loading. Please wait.
Published byTamsyn Burke Modified over 9 years ago
1
SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University of Ljubljana, Faculty of Economics International scientific conference «Corpus linguistics» Saint-Petersburg State University, June 25 – 27, 2013
3
SLOVENIA Population: 1,992,690 Ljubljana (capital) 260,000 Independence: 25 June 1991 (from Yugoslavia) Surface: 20,273 sq km Border countries: Austria, Croatia, Hungary, Italy Adriatic coastline: 46.6 km Highest point: Triglav 2,864 m
4
SLOVENIA (2) Language: Slovene (var.: Slovenian) Ethnic composition: Slovene 83.1%, Serb 2%, Croat 1.8%, Bosniak 1.1%, other or unspecified 12% Religions: Catholic 57.8%, Muslim 2.4%, Orthodox 2.3%, other or unspecified 28%, none 10.1% (2002 census) GDP - per capita: $28,700 (2012) Currency: EURO (introduced in 2007)
5
SLOVENE LANGUAGE Slovenski jezik, slovenščina Western South Slavic language cca. 2,4 mio speakers (1,85 mio first language) 50 regional dialects (limited understanding: „most diverse Slavic language“) Latin alphabet Č, Š, Ž Highly inflected language Particularities: dual
6
SLOVENSKI BESEDILNI KORPUSI 20 < CORPORA AVAILABLE ONLINE REPRESENTATIVE (GENERAL) CORPORA SYNCHRONOUS CORPORA Nova Beseda – 240 mio words, 2004 (cca 10 years‘ coverage) GigaFida – 1,2 bill. words, 1990-2011 SPECIALISED CORPORA – DSI, Jos, Evrokorpus, VAYNA... – EduKorp, Bibliotekarstvo 6
7
Slovene LIS Terminology Long professional tradition Linguistic shortage in the subject field – Lack of written technical texts – German language tradition – Later English influences – NO dictionaries in LIS terminology – Terminology Project 1987 – Important tangible results
8
Usables International Project – Multilingual Dictionaries of Library Terminology English-Slovene Dictionary of Library Terminology (Slovene) Dictionary of Library Terminology – Printed edition – Electronic edition (web, public access) Text Corpus – Korpus bibliotekarstva
9
Korpus bibliotekarstva Specialized corpus Library and Information Science & practice Synchronous Open public access Dedicated in-house software – PC dat aprocessing – Web-based usage – Rich experience (eg. Dictionaries of the Slovene Academy of Sciences and Arts)
10
Texts Defined selection criteria Subject & Level Written texts Electronic published texts only – Digital born – Digitized & published – NO scanning for the corpus Technical limitations and barriers
12
Selected texts & Functions
13
Basic functions Simple/basic search – Single words & phrases – N-grams (N = 1 – 5) – Concordances – Global corpus – selected document segment(s) – Exact matching – Truncation (*) – Upper / lower case Knjižnica - knjižnica
14
Basic functions (2) Advanced search – Frequency search=, Fr>1000 Fr>200 in be:kata* – Word length=, Do=15 Word masking* adjective + substantive * katalog knjižnični *
15
Hyperlinked list of texts & authors
16
Concordance list
17
Citation
18
Full-text access
19
Single word
20
Bigrams
21
Bigrams (2)
22
4-grams
23
Insight 625 texts 353 authors (single or co+authors) 3,66 mio words Lematisation Part of speech tagging 28.808 individual distinctive words Highest frequency- 172.031 (aux. v. „to be“) Hapax legomena- 7.310
24
Frequency distribution First 50
25
Zipf‘s Law vs. experience
26
Parts of speechVerbs
27
NounsAdjectives
28
Accessibility Open Access CC License BLOG Bibliotekarska terminologija http://terminologija.blogspot.com 28
29
Problems & Challenges Choice & acquisition of texts „Analogue“ texts Copyright issues Technical barriers – PDF protected data – Special characters – Special text formatting – Typing errors – Genuine OCR errors
30
Problems & Challenges (2) Linguistic – Highly inflected language Data processing Search Analysis Part of speech tagging – Foreign language „contamination“ General – Resources Human financial
31
Plans Harvesting new texts – Recent / current digital born publications – Recently digitized (e.g. „Knjižnica“) – „Backlog“ 120 graduate theses 28 master theses 25 monographs & proceedings – Scientific analysis – Dictionary updating and supplementing
32
СПАСИБО ЗА ВНИМАНИЕ! Check:http://terminologija.blogspot.comhttp://terminologija.blogspot.com Contact: ivan.kanic@gmail.comivan.kanic@gmail.com http://www2.arnes.si/~ljnuk4/kanic.html
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.