Language Technologies What endangered languages can teach us

Slides:



Advertisements
Similar presentations
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Advertisements

Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Introduction I. Some interesting facts about language
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
What Linguists Want (we think) Helen Aristar Dry & Anthony Aristar LINGUIST List & E-MELD.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Computational Investigation of Palestinian Arabic Dialects
Historical linguistics Historical linguistics (also called diachronic linguistics) is the study of language change. Diachronic: The study of linguistic.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Linguistics The first week. Chapter 1 Introduction 1.1 Linguistics.
Acoustic Properties of Taiwanese High School Students ’ Stress in English Intonation Advisor: Dr. Raung-Fu Chung Student: Hong-Yao Chen.
THE NATURE OF TEXTS English Language Yo. Lets Refresh So we tend to get caught up in the themes on English Language that we need to remember our basic.
Levels of Language 6 Levels of Language. Levels of Language Aspect of language are often referred to as 'language levels'. To look carefully at language.
1 Determining query types by analysing intonation.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
CSA2050 Introduction to Computational Linguistics Lecture 1 What is Computational Linguistics?
Introduction to Linguistics Class # 1. What is Linguistics? Linguistics is NOT: Linguistics is NOT:  learning to speak many languages  evaluating different.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
INTRODUCTION TO APPLIED LINGUISTICS
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Text Linguistics. Definition of linguistics Linguistics can be defined as the scientific or systematic study of language. It is a science in the sense.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Dafydd Gibbon U Bielefeld ELKL-5, Ranchi, Jharkhand, India,
Information Retrieval in Practice
Dafydd Gibbon Universität Bielefeld Germany
Language Identification and Part-of-Speech Tagging
Automatic Writing Evaluation
Measuring Monolinguality
Sentiment analysis algorithms and applications: A survey
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Digital Technologies: Curriculum Connections F to 6
Text-To-Speech System for English
Multimedia Content-Based Retrieval
22. Form-Focused Instruction
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Computational and Statistical Methods for Corpus Analysis: Overview
Natural Language Processing (NLP)
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
A Level English Language Research & Analysis Scrapbook…
Social Knowledge Mining
Statistical NLP: Lecture 9
Studying Spoken Language Text 17, 18 and 19
Text Mining & Natural Language Processing
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Natural Language Processing (NLP)
Artificial Intelligence 2004 Speech & Natural Language Processing
Information Retrieval
Introduction to Search Engines
Extracting Why Text Segment from Web Based on Grammar-gram
Statistical NLP : Lecture 9 Word Sense Disambiguation
Natural Language Processing (NLP)
Presentation transcript:

Language Technologies What endangered languages can teach us Dafydd Gibbon U Bielefeld ELKL-4, Agra, 2016-02-

Focus: role reversal Not just: What can the Human Language Technologies offer to endangered (and other) languages? But: What can Human Language Engineering learn from endangered languages? Or: What do endangered languages require from the Human Language Technologies? ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Applying technologies – a reminder Requirements specification: systems analysis A traditional ‘waterfall model’ of software development Design: Modules, algorithms, data structures Implementation: Programming methods, styles Evaluation: Black box, glass box, field testing Product dissemination: Reliability, sustainability, interoperability, updating ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Applying technologies – a reminder Requirements specification: systems analysis Feedback and revision cycle Design: Modules, algorithms, data structures Implementation: Programming methods, styles Evaluation: Black box, glass box, field testing Product dissemination: Reliability, sustainability, interoperability, updating ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

What can endangered languages teach lang tech? Summary Reversed roles: What can endangered languages teach lang tech? Background to Language Endangerment Technology roles: Documentary technologies Enabling technologies Productivity technologies Annotation and annotation mining How do languages differ in speech rhythm? How similar / different are languages? Language differences/distances among languages of Ivory Coast Endangered languages: What do science and technology (human knowledge in general) lose when languages die? ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Language Technologies Engineering: speech: ASR, TTS, speaker id / recognition, ... language: NLP, NL parsing, Q&A, text mining, text classification, lexicon and grammar induction, machine translation … multimodal: speech I/0 (dictation, process control, speech computer UI), speech avatars (Siri, Cortana), gesture (touchpad, waving), biometric systems Computational linguistics & Computer Science: domain models of natural language syntax, semantics, pragmatics, language typology and genesis formalisms and algorithms for induction, parsing, generation of language corpus analysis for lexicon and grammar induction ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Language Technologies Engineering: speech: ASR, TTS, speaker id / recognition, ... language: NLP, NL parsing, Q&A, text mining, text classification, lexicon and grammar induction, machine translation … multimodal: speech I/0 (dictation, process control, speech computer UI), speech avatars (Siri, Cortana), gesture (touchpad, waving), biometric systems Computational linguistics & Computer Science: domain models of natural language syntax, semantics, pragmatics, language typology and genesis formalisms and algorithms for induction, parsing, generation of language corpus analysis for lexicon and grammar induction OVERLAP (also with linguistics) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

The broader interdisciplinary context ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Doc Ling and Digital Humanities Timeline Humanities: Theology, Philology, Literary Studies, Linguistics, Law, ... 1950 Computation Computational Linguistics Speech Technologies Mathematical Computational Linguistics NLP, Computational Corpus Linguistics Humanities Computing Documentary Linguistics Digital Humanities ? ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Doc Ling and Digital Humanities Timeline Humanities: Theology, Philology, Literary Studies, Linguistics, Law, ... 1950 Computation Computational Linguistics Speech Technologies Mathematical Computational Linguistics NLP, Computational Corpus Linguistics Humanities Computing SGML TEI (HTML) XML Documentary Linguistics Digital Humanities ? ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Roles for computational technologies Documentation technologies Enabling technologies Productivity technologies ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

1. Documentation Technologies Project planning tools Data collection tools scenario support for elicitation recording (multimodal) metadata collection document scanning Data archiving and access database and search model relational, object-oriented, ... Multilinear annotation for search, re-use analysis, application: sharable (sustainable, interoperable) standards annotation categories for phonetics, grammar, discourse, … semi-automatic annotation methods ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

2. Enabling Technologies Resource construction tools phonetic analysis lexicon induction from data word lists word frequency lists (and other word statistics) concordances collocations grammar induction from data Part of Speech (POS) tagging grammar induction parsing and generation translation multilingual dictionaries terminologies processing of parallel or comparable texts translator’s workbench ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

3. Productivity Technologies Recognition Automatic Speech Recognition (ASR) Visual scene and object recognition Information retrieval from text Identification Speaker identification Language identfication Authorship attribution Generation Text-to-Speech Synthesis (TTS) Written text geneneration from databases Products Dictation and information applications Translation applications ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Endangered languages as teachers: some topics ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Endangered languages as teachers: some topics Insights into and results concerning diversity vs. normalization of speech and language the intricacy, complexity of speech, language and languages similarities and differences between languages (typology, history, dispersion of language) scenario-dependent properties of languages gender, age, social role task orientation public vs. informal vs. intimate styles diversity of expression of emotion political status of languages wrt dominance, minorities ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Endangered languages as teachers: some outlets Speech Assessment Methodologies (EC Project) EAGLES: Expert Advisory Groups for Language Engineering Standards Many other resources oriented European Projects, including MATE IMDI ... LREC – Language Resources and Evaluation Conference Language Resources Map https://en.wikipedia.org/wiki/LRE_Map Krauwer’s BLARK: Basic Language Resource Kit ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Endangered languages as teachers: some models Steven Krauwer’s BLARK: Goal of equal status of European languages Generalisable to the world at large? Basic Language Resource Kit initial specification written language corpora spoken language corpora mono- and bilingual dictionaries terminology collections grammars modules (e.g. taggers, morphological analysers, parsers, speech recognisers, text-to-speech) annotation standards and tools corpus exploration and exploitation tools bilingual corpora etc ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Some models: Basic Language Resource Kit (BLARK) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Some models: Basic Language Resource Kit (BLARK) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

concurrent data and parallel processing: phonological tone Many more topics … A closer look at language typology reveals many more requirements for the technogies, for example: concurrent data and parallel processing: phonological tone morphological tone vs. pitch accent vs. stress; prosodic negation grammatical stress/accent, focus, intonation, word order gesture, sign language, discourse participants language similarity for priority setting in language selection for funding (usually: chance) and for system adaptation: role of language typology: similarity/difference in speech sound systems similarity/difference in grammar similarity/difference in the lexicon similarity/difference in discourse conventions similarity/difference in general cultural conventions ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Comparison of DocLing and LangTech scenarios The documentary linguistic scenarios are: rather individual, extremely heterogeneous rather hard to define and delimit de facto standards: Praat, ELAN, Wordsmith, TypeCraft,... somewhat ad hoc - ‘what you can get’ Language, speech, multimodal technology scenarios are: highly standardised, rather coherent tendentially easy to define and delimit very application / product oriented especially in speech technology: highly product specific text technology is more generic regulated standards: statistical evaluation procedures institutional standards (e.g. ISO) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Relations between the technologies: a simplified ‘workflow’ ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Reversed roles: what endangered languages offer to engineering Endangered (and other) language scenarios ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Reversed roles: what endangered languages offer to engineering Endangered (and other) language scenarios Interview, corpus, experimental phonetic and psycholinguistic methods Formal and computational models ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Reversed roles: what endangered languages offer to engineering Lexicons Endangered (and other) language scenarios Grammars Modern software supported descriptive and experimental fieldwork Automatic analysis & machine learning Interview, corpus, experimental phonetic and psycholinguistic methods Text grammars Formal and computational models Discourse models ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Reversed roles: what endangered languages offer to engineering Lexicons Endangered (and other) language scenarios Grammars Modern software supported descriptive and experimental fieldwork Automatic analysis & machine learning Repositories MPI (DoBeS) SOAS (HRELP) PARADISEC LDC ELRA/ELDA … … ... Interview, corpus, experimental phonetic and psycholinguistic methods Text grammars Formal and computational models Discourse models ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Reversed roles: what endangered languages offer to engineering Lexicons Endangered (and other) language scenarios Applications Literature Media Teaching HLT Grammars Modern software supported descriptive and experimental fieldwork Automatic analysis & machine learning Repositories MPI (DoBeS) SOAS (HRELP) PARADISEC LDC ELRA/ELDA Interview, corpus, experimental phonetic and psycholinguistic methods Text grammars Formal and computational models Discourse models ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

What I will be looking at – a modest selection Endangered (and other) language scenarios Teaching: diagnosis HLT: TTS Prosodic Analysis Tone Intonation Timing Rhythm Modern software supported descriptive and experimental fieldwork Automatic analysis & machine learning Interview, corpus, experimental phonetic and psycholinguistic methods Formal and computational models ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

What I will be looking at – a modest selection Endangered (and other) language scenarios Teaching: diagnosis HLT: TTS Prosodic Analysis Tone Intonation Timing Rhythm Modern software supported descriptive and experimental fieldwork Automatic analysis & machine learning Interview, corpus, experimental phonetic and psycholinguistic methods Formal and computational models ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Background to Language Endangerment: A Global Perspective ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Ethnologue metadata repository of all known languages ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Background: Global Perspective on language endangerment Globalization travel, trade, economy, politics → language dominance Trade languages, pidgins: Greek → Latin → Arabic → Portuguese → Spanish → French → English → religion, culture crises Example: climate deterioration (Africa, West Asia) → lack of water → famine → poverty → hunger → disease → migration → diaspora “Clash of Civilizations” (Samuel Huntington) → conflicts: both local and global The role of language stereotypes, discrimination? ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Background: Global Perspective on language endangerment Globalization travel, trade, economy, politics → language dominance Trade languages, pidgins: Greek → Latin → Arabic → Portuguese → Spanish → French → English → religion, culture crises Example: climate deterioration (Africa, West Asia) → lack of water → famine → poverty → hunger → disease → migration “Clash of Civilizations” (Samuel Huntington) → conflicts: both local and global The role of language stereotypes, discrimination? The ‘explanation’ for language endangerment as LOSS OF INTERGENERATIONAL TRANSMISSON OF LANGUAGES, CULTURE, RELIGION, SKILLS, … is too simple. It begs the question: Why is there loss of intergenerational transmission? ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Clash of Civilizations? World religions How far is technology relevant? ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Clash of Civilizations? World language families How far can technology go? ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

ISO 639-3 Language Code Standard – 5 language types Living languages: A language is listed as living when there are people still living who learned it as a first language. Based on intelligibility not politics. Extinct languages: no longer living, gone extinct in recent times. (e.g. in the last few centuries), distinct languages identified based on intelligibility (as defined for individual languages). Ancient languages: If it went extinct in ancient times (> 1000 years), must have had a distinct literature and be treated distinctly by the scholarly community, must have an attested literature or be well-documented as a language known to have been spoken by some particular community at some point in history, may not be areconstructed language inferred from historical-comparative analysis. Historic languages: Must be distinct from any modern languages that are descended from it; for instance, Old English and Middle English. Here, too, the criterion is that the language have a literature that is treated distinctly by the scholarly community. Constructed languages: Must have a literature and it must be designed for the purpose of human communication. Specifically excluded are reconstructed languages and computer programming languages. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

UNESCO chart of endangered languages ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

UNESCO map of endangered languages ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Endangered languages - ‘in danger of disappearing’ UNESCO The language birth and death cycle Why and how do new languages appear / disappear? Why ‘danger’? The economics of language endangerment The politics of language endangerment Social consequences of language endangerment What can the Human Language Technologies do? Documentation Technologies? Enabling Technologies? Productivity Technologies? Which languages can be handled by HLT? ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Enabling Technologies Annotation and Annotation Mining Language Similarity Analysis ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Enabling technologies Annotation (preferably automatic) associating text labels and time-stamps with speech recordings Annotation mining (preferably automatic) information extraction from annotations: text label list, text label frequencies, text label duration statistics visualisation of text label duration patterns, rhythm patterns Similarity analysis – virtual distance mapping Which languages have been (almost) documented? Which languages are likely to respond to adaptive techniques for modelling a language starting with a known model for another language. Similarity feature definition Geographic (areal contact) Typological (similarity of paradigmatic and syntagmatic structures) Genealogical (history of language families) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Enabling technologies Annotation and Annotation Mining ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

in order to make data systematically searchable Why annotation? And how? The primary reason for the annotation of text and speech data is to enable efficient search procedures to provide structure in order to make data systematically searchable by assigning perceptual/hermeneutic categories to data for the purposes of finding archived media linguistic and phonetic analysis development of speech and language systems Searching unstructured data is difficult but improving with the help of machine learning example: search on free text data (e.g. Google search as a machine for on-the-fly concordance construction from web data) example: Google’s image search ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

in order to make data systematically searchable Why annotation? And how? ALL THESE PROCEDURES ARE VERY PRECISELY DEFINABLE AND THEREFORE CAN BE (AND ARE OFTEN) AUTOMATISED WITH APPROPRIATE SOFTWARE (POSTAGGERS, CONCORDANCERS, AUTOMATIC SPEECH ANNOTATORS, ...) THIS IS STANDARD PRACTICE IN THE HUMAN LANGUAGE TECHNOLOGIES (HLT) BUT NOT (YET) IN DOCUMENTARY LINGUISTICS WHEN THIS HAPPENS, THE EFFICIENCY OF MANY ASPECTS OF DL WILL BE REVOLUTIONLSED The primary reason for the annotation of text and speech data is to enable efficient search procedures to provide structure in order to make data systematically searchable by assigning perceptual/hermeneutic categories to data for the purposes of finding archived media linguistic and phonetic analysis development of speech and language systems Searching unstructured data is difficult but improving with the help of machine learning example: search on free text data (e.g. Google search as a machine for on-the-fly concordance construction from web data) example: Google’s image search ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Speech annotation example: graphical display with Praat (Niger-Congo>Gur>Tem, ISO 639-3 kdh) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Annotation mining: the syllable timing space A standard meme in speech timing descriptions is that the spoken forms of languages fall into 3 categories: stress (or foot) timing syllable timing mora timing What kind of speech rhythm – 1/1, 2/4, 3/4, 4/4 ... ? Embarrassing problem: physical correlates are elusive Measures of relative regularity or irregularity of timing relative to the syllable or the foot - global numbers, no detailed information Example: normalised Pairwise Variability Index (nPVI): average duration differences between neighbouring items English: nPVI = 70 (mean=186ms, SD= 127ms, coeffvar=69%) Mandarin: nPVI = 40 (mean=174ms, SD= 59ms, coeffvar=34%) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Automatic Annotation Mining: looking at the details (global descriptive statistics for British English recordings) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Automatic Annotation Mining: looking at the details (KWIC concordance of text labels in context, with statistics) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Duration relations between neighbouring syllables: Automatic Annotation Mining: looking at the details (Wagner Quadrant Graphs for Mandarin, Tem and English) Duration relations between neighbouring syllables: durationn x durationn+1 note the distribution: regular? random? clustering? position? ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Numbers of duration maxima (peaks) at different item distances: Automatic Annotation Mining: looking at the details (rhythm spectrum graphs for Mandarin, Tem and English) Numbers of duration maxima (peaks) at different item distances: x = peak count y = peak distance ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Peak sequences at distances 1,...,3 Automatic Annotation Mining: looking at the details (stylised duration patterns – the reality of rhythm) Mandarin Peak sequences at distances 1,...,3 Tem ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Peak sequences at distances 1,...,3 Automatic Annotation Mining: looking at the details (stylised duration patterns – the reality of rhythm) Mandarin Peak sequences at distances 1,...,3 English ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Automatic Annotation mining: tone in Mandarin and Tem (difference for Tone 5, neutral tone) Tem (no difference) Significance: natural TTS, e.g. reading functionality for the blind, voice information services (e.g. Apple Siri, Microsoft Cortana, Google Now) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Automatic Annotation mining: tone/stress in Mandarin, English (difference for Tone 5, neutral tone) English (gradient difference tendency) Significance: natural TTS, e.g. reading functionality for the blind, voice information services (e.g. Apple Siri, Microsoft Cortana, Google Now) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Time Group Analysis TGA online annotation mining tool http://wwwhomes.uni-bielefeld.de/gibbon/TGA/ http://localhost/TGA/ ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Annotation Mining: rhythm visualisation TGA online tool: visualisation of syllable time relations http://wwwhomes.uni-bielefeld.de/gibbon/TGA/ Duration difference tokens: Keyboard friendly transcription: 2D visualisation of durations: Syllable durations: Duration difference tokens for utterance #37: pos, neg, equal differences between neighbours: /, \, = difference threshold: 40ms. clear indication of syllable isochrony ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Annotation Mining: rhythm visualisation Compare with English pattern ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Enabling Technologies Modelling tones: Niger-Congo > Gur > Tem (Togo) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Annotation Mining: tone automaton induction for TTS Resources: Zakari Tchagbale: “Exercices et corrigés” (1984) Dafydd Gibbon: finite state model of tone sandhi (1987) Zakari Tchagbale: recording (2012) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Annotation Mining: tone automaton induction for TTS Startup effect Automatic downstep Automatic upstep ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Annotation Mining: tone automaton induction for TTS ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Annotation Mining: tone automaton induction for TTS Tone co-occurrences ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Case 2: Allotones: Tem (Gur; ISO 639-2 kth) resources start H; #h; high startup L; #l; low startup H; h; upsweep L; l; downdrift L; ^l; upstep H semi- terrace L semi- terrace H; !h; downstep H; h#; >baseline end L; l#; baseline end ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Case 2: Allotones: Tem (Gur; ISO 639-2 kth) resources start H; #h; high startup L; #l; low startup H; h; upsweep L; l; downdrift L; ^l; upstep H semi- terrace L semi- terrace H; !h; downstep H; h#; >baseline end L; l#; baseline end These contexts are sufficient for formalising phonological rules. More contexts are needed for natural phonetic detail! Allotone pitch modification #h, #l and h#, l# h and l terrace cycles ^l and !h terrace transitions ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Enabling technologies Language similarity and difference as virtual distance Towards the use of Machine Learning ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Enabling technologies: prioritisation Importance of determining language similarities and dissimilarities: re-use of existing materials in different communities beware of problems (e.g. Ibibio / Efik) adaptation of existing productivity technologies prioritisation of documentation funding ... ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

How similar are languages? - A little similar, but not very similar ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Selecting features for similarity distances Lexical Swadesh word list West African Lexical Dictionary Set (WALDS) … Phonetic / phonological vowel set comparison consonant set comparison (more stable – used for many traditional initial classifications, e.g. of Indo-European languages) Grammatical features World Atlas of Language Structures (WALS) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Kru languages – Ivory Coast: Method – consonant systems ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Kru languages – Ivory Coast: Method – consonant systems ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Kru languages – Ivory Coast: Feature – consonant systems Method: compare all consonant systems pairwise: Levenshtein Edit Distance / Hamming Distance visualise differences as ‘virtual distances’ in a chart ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Kru languages – Ivory Coast: Feature – consonant systems ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Kru languages – Ivory Coast: Feature – consonant systems ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Which features are most useful Which features are most useful? - Help from Machine Learning (Decision Tree Induction) ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Conclusion What do we lose if we lose languages? ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

What do we lose if we lose languages? Why is this an issue? What danger?! Imagine what we lose if a language disappears... Why not just use one language worldwide? Remember what language is for! Imagine what we lose if a language disappears: Structure: understanding of the architecture of the human mind – each language has a unique combination of cognitive patterns Language semantics: a language is a repository of knowledge about the world – skills, health Pragmatics: narratives of history and identity of a language community, law, literature (with music and art), religion, ... Lack of interregional communication means that the shared language will develop local differences anyway ... It’s boring ... ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?

Language is complex: Ranks, Interpretations, Languages, Context Syntagmatic properties Grammar – compositionality SEMANTICS/PRAGMATICS CONCEPTS, OBJECTS, EVENTS structural opacity DIALOGUE TEXT SENTENCE CLAUSE PROSODIC HIERARCHIES structural opacity PHRASE Paradigmatic properties Hypostatic properties in different modalities: speech text gesture COMPOUND WORD DERIVED WORD LEXICAL ROOT MORPHEME (MORPHO)PHONEME ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? LEXICON – partial regularity, holistic opacity

Diolch yn fawr! ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies?