DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

Slides:



Advertisements
Similar presentations
Network of Expertise in Long-Term STOrage of Digital Resources Creation of a Network of Expertise in Long-Term- Archiving and Long-Term-Accessibility.
Advertisements

Steps towards E-Government in Syria
Ministerial Conclusions November 2006 David Dawson Senior Policy Adviser Digital Futures.
Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
Successes and challenges for the TFCAs in Southern Africa
Implementing Assistive Technology in School and Beyond 1.
Where do we stand? Harold Somers Centre for Computational Linguistics, UMIST, Manchester, England Panel session, MT Summit VIII, September 2001.
South Africa Culture Group Information from the CIA World Factbook:
An Agent-Oriented Approach to the Integration of Information Sources Michael Christoffel Institute for Program Structures and Data Organization, University.
Bootstrapping pronunciation models: a South African case study Presented at the CSIR Research and Innovation Conference Marelie Davel & Etienne Barnard.
National language and terminology policies – a South African perspective Dr Mariëtta Alberts Standardisation and Terminology Development Pan South African.
The School of Statistics and Planning (SSP)
Indiana Commission for Higher Education Public Square February 13, 2014 Cathrael Kazin, JD, PhD Chief Academic Officer 0.
KarolaYn, Ana, Diego.. T he Republic of South Africa is a country located at the southern tip of Africa. It borders the countries of Namibia, Botswana,
European Language Learning for Life-Long Learning: Issues in Cyprus Victoria Kalogerou Cyprus Academic Research Institute 66, Metochiou str. Nicosia, Cyprus.
ICT work programme ICT 17 Cracking the language barrier Aleksandra Wesolowska Unit G.3 - Data Value Chain.
Printed African Vernacular Literature before and round 1960 Information retrieval and other things.
HLT R&D in South Africa HLT Collaboration between South Africa and the Low Countries Workshop 24 November 2008 Noordhoek, South Africa.
Swapan Deoghuria Scientist-II, Computer Centre Indian Association for the Cultivation of Science Kolkata , INDIA URL:
DE&T (QuickVic) Reporting Software Overview Term
Eureka! User friendly access to the MPI linguistic data archive Max Planck Institute for Psycholinguistics Alexander Koenig Jacquelijn Ringersma Claus.
Integrated Healthcare Information Services Through Mobiles (IHISM) Henry Nyongesa University of Botswana Henry Nyongesa University of Botswana.
PrepTalk a Preprocessor for Talking book production Ted van der Togt, Dedicon, Amsterdam.
Cooperation between PanSALB and terminology structures Dr Mariëtta Alberts Lexicography and Terminology Development PanSALB.
Recent Activities of Speech Corpora and Assessment in Korea Yong-Ju Lee Wonkwang University Korea.
Multilingualism: Training and capacity building Dr Mariëtta Alberts Pan South African Language Board (PanSALB)
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
Localisation Education for Translation Students The Grad. Dip./MA in Translation Studies at Dublin City University Sharon O’Brien Localisation Pre-Summer.
IAEA International Atomic Energy Agency. IAEA Outline LEARNING OBJECTIVES FIRST THINGS FIRST Invitation of a mission Information meeting self-assessment.
Roadmap for Language Resources and Evaluation in a Multilingual Environment Minority Languages in the African Context Justus Roux Centre for Language and.
Sustainability of the work and PANL10n network: Vision beyond 2010 Regional Conference on Localized ICT Development & Dissemination Across Asia PAN Localization.
17 octobre 2013 Open Access Policy of France Open access to scientific publications and research data "The scientific information is a common good that.
By: Colleen Shannon, August Mendes. Literacy technology is the ability to responsibly, creatively, and effectively use appropriate technology. Uses: Communication.
Suléne Pilon & Danie Prinsloo Overview: Teaching and Training in South Africa 25 November 2008;
AREAS OF COLLABORATION AND POSSIBLE FUTURE COLLABORATION IN IKS BETWEEN DAC AND DST Mogege Mosimege Indigenous Knowledge Systems Unit Department of Science.
Dutch HLT Resources: from BLARK to Priority Lists Helmer Strik, Diana Binnenpoorte, Janienke Sturm, Folkert de Vriend, and Catia Cucchiarini* A 2 RT, Dept.
National anthem oid=nl http://nl.netlog.com/go/explore/videos/vide oid=nl Lord Bless Africa.
PROGRESS REPORT ROLE OF PROVINCES ON THE DEVELOPMENT OF INDIGENOUS LANGUAGES ACTING DIRECTOR-GENERAL VELISWA BADUZA (MS) ARTS AND CULTURE NATIONAL LANGUAGE.
The use and usefulness of UPeTD: the University of Pretoria’s ETD repository Elsabé Olivier & Ina Louw.
South Africa in the global knowledge arena: implications for academic libraries Andrew M. KANIKI Executive Director: Knowledge Management and Strategy.
Cooperation for Arabic Language Resources and Tools – The MEDAR Project Bente Maegaard, Mohamed Attia, Khalid Choukri, Olivier Hamon, Steven Krauwer, Mustafa.
Benchmarking Study on the Situation of the United Nations System and Development Institution Websites Results of the World Bank Conference Web for Development:
Work of the National Research Foundation (NRF) Relating to Arts & Culture Prepared for: The Portfolio Committee on Arts and Culture Dr Rocky Skeef Tuesday.
Quality Assurance Agency in the Republic of Moldova PhD Nadejda Velico, Head for higher education department, Ministry of Education Ministry of Education,
UNIZULU INSTITUTIONAL REPOSITORY GATEWAY TO LOCAL CONTENT.
September 2004 Workshop Communicating in two cultures The Franco-German University EUPRIO, Malta 2004 Ulrike Reimann.
Workshop: HLT Collaboration November 2008 Workshop: HLT Collaboration between South Africa and the Low Countries November 2008 Noordhoek, South.
LIBRARY AND INFORMATION SERVICES (LIS) PRESENTATION- PORTFOLIO COMMITTEE Presentation to Portfolio Committee 20 AUGUST 2013 Mr HM Mweli 1.
The Ontario Context \. English Language Learners: A Definiton ELLs are students in provincially funded English language schools whose first language is.
Africa Programme on Gender Statistics Status of implementation United Economic Commission for Africa Meeting of Committee of Directors General November.
1 Strategic Plan Review. 2 Process Planning and Evaluation Committee will be discussing 2 directions per meeting. October meeting- Finance and Governance.
Introduction to the European Union. The European Union Foundation Purpose.
Department of Arts and Culture PRESENTATION TO THE PORTFOLIO COMMITTEE: ARTS AND CULTURE ON THE SOUTH AFRICAN LANGUAGES BILL NOVEMBER 2011 MR SIBUSISO.
CAPACITY BUILDING FOR LANGUAGE FACILITATION Presentation to the Portfolio Committee August 2007.
HUMAN LANGUAGE TECHNOLOGIES (HLT)* NATIONAL STRATEGY *Human Language Technologies (HLTs) are those information technologies that facilitate and support.
EDLproject WP3 “Developing the European Digital Library” LIBER – EBLIDA workshop Digitisation of Library Material in Europe Copenhagen, October.
PARLIAMNETARY PORTFOLIO COMMITTEE PRESENTATION ON THE DEVELOPMENT, USE AND PROMOTION OF SOUTH AFRICAN SIGN LANGUAGE.
New Opportunities Fund Preservation Workshop March 15th 2002 Maggie Jones Cedars Project Manager.
SOUTHERN AFRICA Sub Region IMPLIMENTATION ROAD MAP A STRAWMAN FOR DISCUSSION Prepared for and on behalf of the southern African EO Community Terry Newby.
Promoting Canada’s Language Industry & Stakeholder Collaboration Promoting and Supporting Canada’s Linguistic Duality This project is funded by the Government.
Making South Africa a Global Leader in Harnessing ICTs for Socio-economic Development Overview of Local and Digital Content Strategy of South Africa: Creation,
DG: Department of Basic Education Mr HM Mweli Venue: Cape Town 17 March 2016 JOINT PC BASIC EDUCATION AND HIGHER EDUCATION AND TRAINING O VERVIEW OF THE.
Improving the visibility and impact of journals from developing countries: Experience from Bioline International Leslie Chan Associate Director Bioline.
Deputy director-general: curriculum branch
Dr Elbie Adendorff AILA 2014
South Africa When you think of South Africa, what comes to mind?
Malawi – Biology of Parasitism (MALBOP)
PRESENTATION ON GEOGRAPHICAL NAMES
Geospatial Data Use and sharing Concepts
The Language in Education conundrum from an empirical perspective:
Presentation transcript:

DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART

OUTLINE Concept Resource scarce languages Overview of the language situation in South Africa Lack of language resources and high level support for development of resources Co-ordination of activities in resource development and management The demand for localised language services over digital devices and related opportunities 2

Resource scarce languages “Under-resourced languages are generally described as languages that suffer from a chronic lack of available resources, from human, financial, and time resources to linguistic ones (language data and language technology), and often also experience the fragmentation of efforts in resource development.” (Language Resources and Evaluation (LRE) Journal Special Issue Call, August 2014). 3

Resource scarce languages (2) "This situation is exacerbated by the realization that as technology progresses and the demand for localised languages services over digital devices increases, the divide between adequately- and under- resourced languages keeps widening." (Language Resources and Evaluation (LRE) Journal Special Issue Call, August 2014). 4

Issues are A chronic lack of available resources, from human, financial, and time resources to linguistic ones Fragmentation of efforts in resource development As technology progresses the demand for localised languages services over digital devices increases But first, consider the language situation in South Africa 5

Language Situation in South Africa Home language (n = 52 mil speakers) 11 Official languages 6

Nguni group Sotho group Tshivenda / Xitsonga group isiZulu isiXhosa Siswati isiNdebele Northern Sotho / Sepedi Southern Sotho / Sesotho Western Sotho / Setswana Tshivenda Xitsonga Cross border languages: Mozambique, Zimbabwe, Swaziland, Lesotho, Botswana The official African languages grouped 45% 24% 4% 7

Similarities at different levels within groups Sotho group - disjunctive spelling – lexical items Ke tla bolela Sepedi.I will speak Sepedi. Ke tla bua Setswana.I will speak Setswana. Ke tla bua Sesotho.I will speak Sesotho. Nguni group - conjunctive spelling – lexical items Ngizokhuluma isiZulu.I will speak isiZulu. Ndizothetha isiXhosa.I will speak isiXhosa. Implications for NLP Grammatical structures across language groups the same Regular spelling: Grapheme to phoneme conversion – direct Tone languages – specific implications and challenges for TTS systems 8

Afrikaans and its Germanic roots English: My hand is in warm water. Afrikaans:My hand is in warm water. Dutch:Mijn hand is in warm water. German:Meine Hand ist in warmen Wasser. Danish:Min hånd er i varmt vand. Norwegian:Min hånd er i varmt vann. Swedish:Min hand är i varmt vatten. Implications Bootstrapping Afrikaans systems from e.g. Dutch. 9

ISSUE #1 Chronic lack of available (digital) resources, from human, financial, and time resources to linguistic ones Digital resources for previously marginalised languages extremely limited: newspapers, periodicals, relatively low presence on the Web Lack of language expertise – no tradition of Computational Linguistics - limited number of students in local languages – only North-West University with degree courses in Language technologies ("Linguists are still needed" – Ed Greffenstatte) Growing expertise in Computer Science and Signal processing with focus on natural languages in most of the larger universities. Financial support mainly ad-hoc from private sources 10

Various initiatives for text and speech data collections over a number of decades – mainly for linguistic / phonetic research at academic institutions – difficult to share resources Continued academic pressure (on grounds of the constitution) on government for support of research and development of Language Technologies - not to marginalise the indigenous languages again Large data acquisition projects sponsored by national government since 1999 – Part of National Language Plan (RSA and India are only countries with official policy regarding LT development). 11

Ministerial Panel: HLT Strategy for South Africa (2002) Focus on digital resources: text & speech (SA official languages) 2008: Human Language Technology Expert Panel (HLTEP) established commissions HLT application projects annually with governmental funds these projects invariably create digital resources obvious that it was necessary to create a central depository for all newly created language resources Ongoing major projects since 2000 in text and speech domains Refer to RMA resources to be discussed 12

ISSUE #2 Fragmentation of efforts in resource development Various language projects across the country generating text and speech resources for different purposes – availability of the data (?) Resources from projects commissioned by the HLTEP (i.e. funded by tax payers money) needed to be deposited in a central place 2012: The National Department of Arts and Culture (DAC) established Resource Management Agency (RMA) at the North-West University (Potchefstroom) under the auspices of the Centre for Text Technology (CTexT) as a 3 year project. ( 13

14

NEWSLETTER 15

Contents of the RMA ) LANGUAGE AfrikaansAfrikaans (31) EnglishEnglish (30) isiNdebeleisiNdebele (20) isiXhosaisiXhosa (23) isiZuluisiZulu (27) Sesotho sa Leboa (Sepedi)Sesotho sa Leboa (Sepedi)(22) SetswanaSetswana (20) Sesotho (Southern Sotho)Sesotho (Southern Sotho) (22) SiswatiSiswati (20) TshivendaTshivenda (20) XitsongaXitsonga (24) DutchDutch (4) YorubaYoruba (3) PROJECT AutshumatoAutshumato (18) LwaziLwazi (36) NCHLT TextNCHLT Text (43) NCHLT SpeechNCHLT Speech (13) African Speech TechnologyAfrican Speech Technology (15) DATABASE TYPE Monolingual Speech Corpora: AnnotatedMonolingual Speech Corpora: Annotated (22) Multilingual Text Corpora: AlignedMultilingual Text Corpora: Aligned (3) Monolingual Text Corpora: AnnotatedMonolingual Text Corpora: Annotated (1) RESOURCE TYPES Data Modules Applications Tools/ Platforms 16

FROM RMA TO NATIONAL CENTRE FOR DIGITAL LANGUAGE RESOURCES (NCDLR) RMA: status 3-4 year project (2012 – 2015) (Dept of Arts & Culture) Untenable as development of resources is ongoing (living archive) National Department of Science and Technology (DST) (2014): International panel to determine a new South African Research Infrastructure Roadmap (SARIR) Presentations made to include language (Humanities) and technology in a Roadmap dominated by natural science, medicine, engineering, earth sciences etc. June 2015: The National Centre for Digital Language Resources approved – long term funding (Press statement of DST to follow soon) 17

National Centre for Digital Language Resources University of Pretoria Department of African Languages CSIR MERAKA INSTITUTE (Human Language Technologies Research Group ) North-West University Centre for Text Technology (CTexT) University of South Africa Department of African Languages University of South Africa Department of African Languages ICELDA PARTNERSHIP 18

NATIONAL CENTRE FOR DIGITAL LANGUAGE RESOURCES Functions Single point of entry for information on SA language resources (portal) Free open access for academic research Licensed access for commercial applications Includes RMA resources Systematic digitisation of scientifically valuable language resources – historical nature (Scientific committee) 19

Systematic digitisation of different registers/modes of language resources by the Centre, as well as by academics/public as open call funded projects Combine these projects with MA / PhD studies with data to be deposited at Centre Resource centre for studies in the domain of Digital Humanities 20

ISSUE #3 Demand for localised language services over digital devices increases Available At text level Spelling checkers for all SA languages – CTexT (Microsoft) Machine translation – government documents – CTexT (Autshumato IMT) On-line translations: e.g. and various others software programs ranging from word lists to communication phraseswww.Translate.orgwww.Freelang.net At speech/text level (interactive telephone based systems) (Major projects) African Speech Technology: Hotel reservation system in 5 languages (prototype) conf.org/proceedings/lrec2004/summaries/445.htmwww.lrec- conf.org/proceedings/lrec2004/summaries/445.htm LWAZI I and II: Various community based applications 21

Why do we need to speed up localised language services? There is a demand for a wide array of language based communication systems: Interactive multilingual voice systems as information systems Interactive text-to-speech systems Literacy training in different languages Language specific reading support for the blind Machine translation systems for public use Speech-to speech communication systems with various language pairs Etc…… There are specific research and business opportunities – consider the following 22

Mobile telephone penetration selected countries 23 Mobile cellular subscriptionsMillion Japan149 Nigeria127 Germany100 South Africa76 Korea (Rep)55 France36 Mobile cellular subscriptions per 100 inhabitants South Africa146 Germany121 Japan117 Korea (Rep)111 France98 Nigeria73

24

Conclusion Challenges for the development and management of different types of language resources and applicable tools, Academic considerations: insights into language structures and use Commercial considerations: providing multilingual applications for a growing market, specifically in the African context In order to meet these challenges it is necessary to develop and update language resources not only on a case to case basis, but also systematically in a coordinated manner over as long a period as possible. This is what we are attempting to do in the South African context. 25

Thank you for listening. 26