Overview of corpora and other language resources

Slides:



Advertisements
Similar presentations
Central Baltic program 2011 – 2013 Ralph-Johan Back.
Advertisements

Lake Land College Library Tim Schreiber Information Services Librarian.
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Europeana: Europe's Digital Library, Museum and Archive Ashley Carter and Dana Sagona.
Research methods in corpus linguistics Xiaofei Lu.
Research Methods & Data AD140Brendan Rapple 2 March, 2005.
Introduction to Interactive Media 02. The Interactive Media Development Process.
Rich Foley - Executive Vice President Academic & Public Markets Helen Wilbur - Vice President Consortia Sales & Marketing Digital ArchivesResearch CollectionseBooks.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
DELOVODNIK PRO A short presentation. Standard mail record keeping Most companies these days receive and send a lot of paper mail. Too many of them keep.
Copy cataloguing in Finland Juha Hakala The National Library of Finland
Effective Searching Missy Harvey Computer Science Librarian
The BNC Design Model Adam Kilgarriff, Sue Atkins, Michael Rundell The Lexicography MasterClass
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Research Skills GS 140. Your research proposal ( Assignment 7 – due week 14) What is your proposed research problem? What has been written by others on.
The Balanced Tagged Corpus of Icelandic and Other Icelandic Language Technology Resources Eiríkur Rögnvaldsson, University of Iceland Sigrún Helgadóttir,
INTELLECTUAL RIGHTS AND HISTORIC CORPORA Mark Sandler University of Michigan ICOLC, March, 2003.
Chapter 17.1 Civic Participation. A Tool for Political Education and Action ► The Internet is a mass communication system of millions of networked computers.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Information Literacy *Internet searches and Copyright* Created by Madison Library Media Specialists.
Using Corpora in TEFL By Terri Yueh. WhyWhy Work With Corpora? Why  From Vocabulary to Corpus  Choosing a Corpus Choosing a Corpus  Examples of Word.
Modern lexicography in Iceland 10th annual conference of EFNIL at Budapest October Guðrún Kvaran - University of Iceland.
Research – using the Internet and other secondary sources and Source analysis Top Tips – get ready to make your own notes!
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
EIFL Licensing Training 2: EIFL LICENCE AGREEMENTS
IFLA Newspapers pre-conference Geneva, Arturs Zogla
Measuring Monolinguality
Advanced Higher Modern Languages
Introduction to Corpus Linguistics
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Computing Fundamentals
PastPerfect.
A multi-modal newspaper and internet information resource for schools
Using computers to search electronic databases
Digital Citizenship by Jeremy Head.
Computational and Statistical Methods for Corpus Analysis: Overview
Corpus Linguistics I ENG 617
Conditional Random Fields for ASR
Getting Innovative with OER
Corpus Linguistics I ENG 617
Corpus Linguistics I ENG 617
Linking persistent identifiers at the British Library
Introduction to electronic resources management
What’s all the talk about????
Research Sources & strategies
Lesson 1: Introduction to Trifacta Wrangler
Ian Ramsey C of E School GCSE ICT Smart working Software choices.
European Network of e-Lexicography
Academic Communication Lesson 2
Audio Books for Phonetics Research
Databases.
Lesson 1 – Chapter 1B Chapter 1B – Terminology
The Five Stages of Writing
IL Step 3: Using Bibliographic Databases
Researching and Evaluating the Literature
National Curriculum Requirements of Language at Key Stage 2 only
Everything you wanted to know about Creative Commons Licenses
Computational Linguistics: New Vistas
The Five Stages of Writing
HathiTrust And Its Research Center
Statistical n-gram David ling.
The Law of Planning 1 Define your goal Define what data you need
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Arts Web.
Introduction to Search Engines
Presentation transcript:

Overview of corpora and other language resources The Árni Magnússon Institute for Icelandic Studies Overview of corpora and other language resources The Árni Magnússon Institute for Icelandic Studies In recent years, starting about 10 years ago, we have been building various language resources for Icelandic, to be made available electronically. Much of this work has been at or in collaboration with The Árni Magnússon Institute. We maintain a website (malfong.is) which lists all available language resources, not just our resources but everything else we know about too. I will introduce the corpora and language resources that have been built in recent years and made available by The Árni Magnússon Institute for Icelandic Studies. I will briefly describe the resources and their usage licenses. I may also mention resources made available by others. Some of our data sets have quite severe restrictions of use but others can be used for almost anything in almost every way imaginable. Some licenses were specially made for the data set in question and some are CC licenses. Some of the datasets I talk about have “recently” been made available using “open” licenses, even though they were collected years before. And that is what we want to do, make our data as open as possible so as many as possible can and will use them. March 9th, 2015

Overview Text Corpora Speech Corpora Language descriptions and Dictionaries Language Tools

Text Corpora Proprietary license Proprietary license CC BY 3.0 license Others: Icelandic Parsed Historical Corpus; Wortschatz The Icelandic Frequency Dictionary was the first big project where a considerable amount of contemporary Icelandic texts were collected, analyzed and reported on. When we started training taggers, i.e. for tagging the Tagged Icelandic Corpus, this was the obvious data to use and about 70% of the original data has been available to everyone for a few years now. Not all, because when the data was collected, mostly from books (up to 50% is fiction), the copyright holders only accepted the use for the purposes of that research, so we had to get them to accept another license when we published it electronically 20 years later. This license is quite complicated but essentially only allows use for research. The tagged Icelandic Corpus is more balanced, so to say. It includes texts from printed books, fiction and non-fiction, newspapers, periodicals, websites (blogs, educational, government, etc.) speeches made at the Parliament, student essays, radio and tv scripts, e-mail lists, etc. Some of that data is freely available for everyone for free use, but most of it is not and therefore the whole corpus has a license similar to that of the IFD. A 1 million word subset of MIM (tagged icelandic corpus) was created, MIM-GOLD. It was automatically tagged (using IFD as training data) and the tags were then manually corrected. This work is in its final stages. The planned correction process was finished last year and we have estimated the tagging accuracy using this data. We found a few flaws that will not be too much of a hassle to fix to get the accuracy of the tags to the same level as the IFD. The current version is available for download now. It has the same license as MIM. Only for research. The Saga Corpus is a corpus of old Icelandic sagas, 41 texts in all. Most of the texts were published in this form between 1985 and 1991, the texts were normalized to Modern Icelandic spelling and several inflectional endings were also changed to modern icelandic form. It was tagged iteratively, first using a method developed for modern Icelandic. The tagging accuracy was measured in random samples (88%, compared to 90.4% for IFD texts). Some texts were then selected for manual correction. They were added to the IFD data and a new model created, finally reaching accuracy of 92.7%. The saga corpus is distributed with a CC BY 3.0 license, which makes it pretty close to public domain. Icelandic Parsed Historical Corpus – is a diachronic corpus with samples of written Icelandic from the 12th century to modern times. 1 million words and is mostly comparable to the corpora of historical English, developed at Upenn. Wortschatz is a text corpora of more than 500 million running words, mostly from the National Library's web scraping archives. (2005 + 2010) Developed at the Univeristy of Leipzig.

Speech Corpora Parliament Speech Corpus Hjal Corpus Málrómur 20 hours of speech CC BY 3.0 Hjal Corpus Collected in 2003 for speech recognition Málrómur Currently 44 hours of clean speech Collected in cooperation with Google for Speech Recognition CC BY 4.0 ISLEX Recordings Recordings of all the Icelandic words (48.500) in the ISLEX dictionary and roughly 700 phrases. CC BY NC ND 3.0 Others: Jensson Corpus, Thor Corpus, RÚV discussions. The Parliament Speech corpus contains 20 hours of speech (180.000 running words) in synchronized text- and sound files. Recordings from 2004-2005 with detailed transcriptions in text files. Information about speakers (age, gender) are provided as well. This data is intended to reflect natural spoken Icelandic under formal conditions. The discussion periods were chosen as they primarily consist of unprepared speeches that are unlikely to have been written in advance and read out loud. The transcriptions and processing of the material was mostly carried out by students. The Hjal Corpus was collected in 2003 for training a speech recognition system. It contains over 90.000 sound files, each containing an utterance of one or more words recorded over phone. Most of them have only a single word (but there are numbers, place names, etc.) It is hard to estimate the total duration because the sound files contain lots of silences before and after the utterances. Málrómur is the most recent speech corpus. It was collected 3 years ago by Reykjavik University and The Árni Magnússon institute in collaboration with Google. Google used this data as a basis for training their recognizer for Icelandic, but the recordings are open source, that is available to all. We recorded around 130 thousand utterances and include information on speakers (age group, gender). We are in the process of cleaning the data, that is cutting of long silences before and after the utterances and making sure the spoken text is the same as the prompts given. We have published 57 thousand utterances, in total around 44 hours of clean speech. This data was recorded using Samsung phones, but not through the phone line. ISLEX Recordings. Recorded in a studio. Read by one woman (50 years old). The three corpora mentioned at the bottom contain in total between 6 and 7 hours of speech, with multiple speakers under good recording conditions.

Language Descriptions and Dictionaries Pronunciation dictionary Over 50.000 phonetically written word forms CC BY 3.0 BÍN – Database of modern icelandic inflection 276.512 paradigms Proprietary license The Icelandic Terminology Bank 42 termbases CC BY SA 3.0 ISLEX – Icelandic – Scandinavian dictionary 50.000 words CC BY NC ND 3.0 IceWordNet The pronunciation dictionary was built as a part of the Hjal-project, discussed earlier. This is a list of phonetically transcribed words read by the participants in the Hjal-project. The Database of modern icelandic inflection is a collection of paradigms. The project started in 2002 and the work is still ongoing. It currently contains more than 276.000 paradigms. The data is available for download and can be used with certain restrictions, such as the user is not allowed to distribute the database to others. This database has proved very useful and is used in a wide variety of projects. Everything for web search, spellcheckers/grammar checkers to computer games. The Icelandic Terminology Bank is a syndicate of termbases, which have been collected by specialists in their fields. The Árni Magnússon Institute has provided the infrastructure for keeping records of the terms and publishing them online. The terminology bank contains around 60 searchable termbases, and 42 of these can be downloaded and used under an open license. The termbases vary greatly in size and details, with the smallest containing a few hundred terms but the biggest tens of thousands of terms. ISLEX is an online multilingual dictionary with modern Icelandic as a source language and Danish, Norwegian and Swedish as target languages, with Faeroish being opened this month and Finnish is also being worked on. The online dictionary access is free of charge and the data is available for researches under a non-commercial, non-derivative license. IceWordNet is based on Princeton WordNet. It consists of nearly 5000 Icelandic translations of the words from the core list from Princeton, along with the Icelandic synonyms of the words

Language Tools Older tools: CombiTagger, IceNLP, Lemmald New tools: Skrambi, Nefnir, Kvistur A few tools have been developed for working with language resources, these include CombiTagger, Lemmald and IceNLP for tagging, lemmatizing, tokenizing, parsing and recognizing named entities. These older tools can use some updating as their accuracy is not always optimal, to say the least. Tomorrow we will hear about some of the new tools being developed. These include a spellchecker and lemmatizer.