Download presentation
Presentation is loading. Please wait.
Published byErick Crawford Modified over 9 years ago
1
Corpus Lingustics 2013, Lancaster University, July 25th 2013 Digital corpora and other electronic resources for Maltese Albert Gatt Institute of Linguistics, University of Malta Slavomír /bulbul/ Čéplö Institute of the Czech National Corpus, Charles University www.bulbul.sk/cl2013
2
Part 1 Some general information
3
Maltese Affiliation Afro-Asiatic > Semitic > Central Semitic > Arabic > North-African Writing system Latin script Spoken in Malta (~ 400.000) Australia (34.396) 2011 census Status National language (Const. ch. 1, sec. 5.1), official language along with English (Const. ch. 1, sec. 5.2) Regulating body Il-Kunsill Nazzjonali tal-Ilsien Malti
4
Part 2 A bit of history
5
Corpus linguistics and Maltese then MaltiLex (Rosner et al. 2000) Groundwork for electronic lexicography in Maltese Total size: ca. 1.500.000 tokens, mostly newspapers Never publicly released Preliminary experiments on POS Tagging PsyCol Maltese Lexical Corpus (Francom, Woudstra and Ussishkin 2009) Data retrieved from the web (mostly newspapers) Total size: 3.323.325 tokens Used primary in lexical access experiments
6
Corpus linguistics and Maltese now Clarin www.clarin.eu METANET4U www.metanet4u.eu The two corpora discussed here
7
Part 3 The corpora
8
MLRS Corpus (University of Malta) http://mlrs.research.um.edu.mt/index.php?page=31 Running on IMS Open Corpus Workbench Two versions: V1.0: 100 million tokens, mostly publicly available texts, no annotation V2.0 beta: 130 million tokens, PoS-tagged bulbulistan corpus www.bulbul.sk/bonito2 Running on NoSketchEngine Alpha version: ~ 150 million tokens, no annotation Coming soon: beta version, ~ 160 million tokens, PoS-tagged
9
Data collection MLRSbulbulistan Web data Automated keyword-based webcrawling Domain-targeted retrieval Belles lettres Author submission (privately or through publisher) and webcrawling Scanning > OCR > checking and processing; straight up purchase Others User submission
10
Post-processing MLRSbulbulistan Text extraction (web data) HTML parsersHMTL parsers Structure analysis Paragraph and sentence splitting, tokenization Sentence splitting, tokenization Cleaning Removal of non- Maltese text (semi-automatic) Removal of non- Maltese text on sentence-level (see next step)
11
Post-processing (continued) MLRSbulbulistan Deduplication On VERT file (Onion deduplication tool) at paragraph level At document level PoS Tagging TnT trained on ~28k words, 95% accuracy SVMTool / Apache OpenNLP trained on ~25k words, 94% accuracy Spellcheck Custom dictionary- based spell check (Rosner et al. 2012) Only to correct diacritics, done as a part of tagging
12
Post-processing (continued) MLRS tagset EAGLES-like division into 41 major categories with morphosyntactic features Two-level annotation scheme: Level I: annotation of major category only Level II: addition of morphosyntactic features Current release (MLRS V2.0 beta) only has Level I annotation. Ongoing work on automatic morphological analysis; aim to combine POS tagging with this for Level II WordLevel I TagLevel II additional features raġel ‘man’NN (noun)sg, masc mar ‘he went’VV (main verb)3sg, masc, perfective għandu ‘he has’VG (pseudo-verb)3sg, masc, perfective
13
Post-processing (conclusion) bulbulistan tagset 55 categories based on morphological and semantic criteria Three levels: Major category (NOUN, PRON, VERB) Subcategory (NOUN_PROP, PRON_PERS, PART_ACT) Some morphological information (VERB.PERF, VERB.IMPF, PRON_PERS.NEG) All very much work in progress with the ultimate goal to align the tagset with MLRS WordTaggedNotes raġel ‘man’raġel|NOUNNoun mar ‘he went’mar|VERB.PERFverb, perfective m’hijiex ‘she is not’ m‘|NEG hijiex|PRON_PERS.NEGnegative particle + negated personal pronoun
14
Composition Text typeTokens Journalistic texts68.800.000 Parliamentary debates 43.400.000 Belles lettres375.000 Academic texts170.000 Legal texts4.800.000 Religious texts403.700 Speeches18.000 Web pages (blogs etc., including Maltese Wikipedia articles) 6.500.000 Miscellaneous other texts 123.000 Text typeTokens Journalistic texts80.000.000 Parliamentary debates 50.000.000 Belles lettres600.000 Academic texts100.000 Other (blogs, ads etc.) 50.000 MLRS Corpus bulbulistan corpus
15
Balance and representativeness MLRSbulbulistan corpus Opportunistic text collections Ongoing effort to achieve balance by expanding underepresented text types: What is balanced / representative in a bilingual society with languages in complementary roles? Maltese: belles lettres, humanities English: sciences, economics Collaboration with publishers Online submission system for registered members (followed by filtering and post-processing) Collaboration with authors Scanning and OCR (especially for out-of-copyright works)
16
Diachronic dimension Current status: Majority of texts date to 1998-2013 (journalistic texts, records of 9 th through 12 th legislature) Literary works from late 19 th through early 20 th century (~ 200k), some from 1945-1980 (~ 100k) Goal: Extend the corpora to cover the history of Maltese Two major periods 1824 (first book in Maltese published) - 1924 1924 (establishment of official orthography) - present
17
Diachronic dimension (1824-1924) 1831 1848 1885 1924 tiegħekqiegħedħwejjeġmhux
18
Corpus as research tool Some recent papers: Čéplö, S. 2013. „An overview of object reduplication in Maltese“ (corpus-informed) Fabri, R. and Gatt, A. 2013. “Morphological Productivity in Maltese: A corpus-based investigation of Romance derivational processes” The corpus is also used by translators high-school and college students everybody interested in Maltese
19
Part 4 Beyond corpora
20
MLRS corpus-related tools Maltese Language Resource Server (mlrs.research.um.edu.mt) Maltese Language Software Services (metanet4u.research.um.edu.mt) Paragraph and sentence splitter Tokenizer PoS Tagger Chunker
21
Other tools Grammatical Framework Maltese Resource Grammar Library for Grammatical Framework http://www.grammaticalframework.org/ Verbs An online database of root-and-pattern verbs (Camilleri and Spagnol 2013) http://mlrs.research.um.edu.mt/resources/verbalroots/ Multimodal corpus MAMCO (Maltese Multimodal Corpus) (Paggio, Galea and Vella 2013) Twelve video-recorded conversations, annotated with speech and gesture Project Vassalli The Maltese equivalent of the Guttenberg project
22
Part 5 What’s next
23
Next steps Merge the two corpora into one More texts (duh), especially from areas not represented well More annotation levels (basic morphological analysis) Creation of a balanced subcorpus Syntactic parsing > treebank Integrate the two corpora into larger projects SketchEngine InterCorp
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.