Laura A. Janda UiT The Arctic University of Norway Francis M. Tyers

Slides:



Advertisements
Similar presentations
A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications.
Advertisements

Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Research methods in corpus linguistics Xiaofei Lu.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
CS 478 – Tools for Machine Learning and Data Mining The Need for and Role of Bias.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Statistical Properties of Text
Unit B Constructing Complex Searches Internet Research Third Edition.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Coaching protocol practice Each team select a map (unit) to use for the coaching protocol. Join another person / team not from your grade or department.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Using language corpora in developing Arabic lessons & syllabuses
Hidden Slide for Instructor
Vocabulary Module 2 Activity 5.
Evaluation Anisio Lacerda.
English-Korean Machine Translation System
Collocation – Encouraging Learning Independence
Measuring Monolinguality
Introduction to Corpus Linguistics
LEXICAL APPROACH.
Searching corpora.
Does Russian have full paradigms?
Why our language textbooks are like overstuffed suitcases

Introduction to the Validation Phase
How Do We Translate? Methods of Translation The Process of Translation.
Smoothing 10/27/2017.
Feature Selection for Pattern Recognition
Can You Read a Graph?.
The BonPatron Vocabulary Guide
A CORPUS-BASED STUDY OF COLLOCATIONS OF HIGH-FREQUENCY VERB —— MAKE
Machine Learning Feature Creation and Selection
Corpus Linguistics I ENG 617
Training Tree Transducers
Chapter 11: Indexing and Hashing
Teaching Inflection Without Paradigms
Masculine Nouns, Adjectives, and Demonstratives
Laura A. Janda, UiT The Arctic University of Norway
Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov
Core Concepts Lecture 1 Lexical Frequency.
The Scientific Method.
The CoNLL-2014 Shared Task on Grammatical Error Correction
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Assessment for Learning
Four Languages Verbs from the Bottom up
Statistical n-gram David ling.
Measuring Complexity of Web Pages Using Gate
Introduction: Statistics meets corpus linguistics
Domain Mixing for Chinese-English Neural Machine Translation
Competence and performance
Strategic learning of Russian through cooperation
Language Arts: Tuesday, December 4, 2018
Using Supporting Materials for Your Speech
Chapter 11: Indexing and Hashing
The Strategic Mastery of Russian Tool (SMARTool): A Usage-Based Approach to Acquiring Russian Vocabulary and Morphology Laura A. Janda, UiT The Arctic.
The Strategic Mastery of Russian Tool (SMARTool): En ny måte å lære russiske paradigmer på Новый метод для усвоения русских парадигм Laura A. Janda, UiT.
Volunteer Workshop Spring 2019.
NOTE: Make sure your students know there is no “official” “scientific method.” This terminology is simply used to refer to a typical process of experimentation,
Presentation transcript:

TWIRRLL Workshop: Targeting Word Forms In Research-based Russian Language Learning Laura A. Janda UiT The Arctic University of Norway Francis M. Tyers Высшая школа экономики, Москва

Overview Evidence for strategically focusing learning on key forms and constructions (instead of full paradigms) Presentation of Learners’ Constructicon of Russian and search functions of Russian National Corpus Hands-on workshop in small groups Reporting our results and crowdsourcing the Constructicon

Evidence for strategically focusing learning on key forms and constructions (instead of full paradigms) Russian and the relationship between paradigm size and number of full paradigms for nouns There are 1-3 word forms that account for most of the frequency of any noun In aggregate, partially overlapping subsets of forms populate the space of Russian nouns, verbs, and adjectives: computational experiment comparing training on full paradigms vs. single forms Memorizing full paradigms for all words is like overstuffing your suitcase

Zipf’s Law Тhe frequency of a word is inversely proportional to its frequency rank Zipf’s Law scales up infinitely 50% or more of all unique words are hapaxes

Zipf’s Law applies to word forms too Language & Corpus Name Corpus Size Paradigm Size Total Lexemes Lexemes with full Paradigm % Lexemes with full Paradigm English Web Treebank 254,830 2 6,369 1,524 23.92% Norwegian Dependency Treebank 311,277 4 12,587 393 3.12% Russian SynTagRus 1,032,644 12 21,945 13 0.06% Czech Prague Dependency Treebank 1,509,242 14 17,904 3 0.02% Estonian ArborEst 234,351 28 14,075 0%

Zipf’s Law applies to word forms too Language & Corpus Name Corpus Size Paradigm Size Total Lexemes Lexemes with full Paradigm % Lexemes with full Paradigm English Web Treebank 254,830 2 6,369 1,524 23.92% Norwegian Dependency Treebank 311,277 4 12,587 393 3.12% Russian SynTagRus 1,032,644 12 21,945 13 0.06% Czech Prague Dependency Treebank 1,509,242 14 17,904 3 0.02% Estonian ArborEst 234,351 28 14,075 0% Because Zipf’s Law scales up, these numbers will never change substantially, no matter how large the corpus is

High-frequency Russian Nouns ‘fear’ ‘soldier’ ‘department’ ‘concept’ ‘memory’ Nsg страх солдат отделение концепция память Gsg страха солдата отделения концепции памяти Dsg страху солдату отделению Asg концепцию Isg страхом солдатом отделением концепцией памятью Lsg страхе отделении Npl страхи солдаты Gpl страхов отделений концепций Dpl солдатам Apl Ipl страхами отделениями концепциями Lpl страхах солдатах отделениях Key: bold >20%, plain >10%, grey 1-9%, (blank) unattested

More High-Frequency Russian Nouns ‘background’ ‘champion’ ‘extent’ ‘frame’ ‘difficulty’ Nsg фон чемпион трудность Gsg фона чемпиона трудности Dsg чемпиону Asg чемпионa Isg чемпионом трудностью Lsg фоне протяжении Npl чемпионы рамки Gpl чемпионов рамок трудностей Dpl чемпионам Apl Ipl чемпионами рамками трудностями Lpl рамках трудностях Key: bold >20%, plain >10%, grey 1-9%, (blank) unattested

Masculine animates

Typically a lexeme is found in only 1-3 wordforms Masculine animates

Typically a lexeme is found in only 1-3 wordforms The typical wordforms are motivated by constructions Masculine animates

NomPl аналитики отмечают ‘analysts make the point that’ Typically a lexeme is found in only 1-3 wordforms The typical wordforms are motivated by constructions InsSg стать/быть чемпионом ‘become/be the champion’ Masculine animates

Computational experiment: nouns, verbs, adjectives Based on an ordered list of the most frequent forms in SynTagRus Machine learning: Given the 100 most frequent forms, predict the next 100 most frequent forms Given the 200 most frequent forms, predict the next 100 most frequent forms Given the 300 most frequent forms, predict the next 100 most frequent forms Given the 400 most frequent forms, predict the next 100 most frequent forms Given the 500 most frequent forms, predict the next 100 most frequent forms … until 5400, when SynTagRus runs out of data

Computational experiment: nouns, verbs, adjectives This is the training data Based on an ordered list of the most frequent forms in SynTagRus Machine learning: Given the 100 most frequent forms, predict the next 100 most frequent forms Given the 200 most frequent forms, predict the next 100 most frequent forms Given the 300 most frequent forms, predict the next 100 most frequent forms Given the 400 most frequent forms, predict the next 100 most frequent forms Given the 500 most frequent forms, predict the next 100 most frequent forms … until 5400, when SynTagRus runs out of data

Computational experiment: nouns, verbs, adjectives This is the testing data Based on an ordered list of the most frequent forms in SynTagRus Machine learning: Given the 100 most frequent forms, predict the next 100 most frequent forms Given the 200 most frequent forms, predict the next 100 most frequent forms Given the 300 most frequent forms, predict the next 100 most frequent forms Given the 400 most frequent forms, predict the next 100 most frequent forms Given the 500 most frequent forms, predict the next 100 most frequent forms … until 5400, when SynTagRus runs out of data

Data for training and testing from SynTagRus Frequency & Form Lemma POS Parse of form 1447 может мочь VERB Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 1286 года год NOUN Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing 999 лет Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur 832 году Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing 813 время время Animacy=Inan|Case=Acc|Gender=Neut|Number=Sing 678 россии россия Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing 571 могут Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 571 люди человек Animacy=Anim|Case=Nom|Gender=Masc|Number=Plur 543 россии Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing 436 является являться 416 случае случай 411 людей Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur 403 страны страна 400 жизни жизнь

So the model that gets the most input should be the most successful, right?

Maybe not… So the model that gets the most input should be the most successful, right?

Single forms model outperforms 1800-5400: Single forms model outperforms full paradigms

Excess data is probably overpopulating the search domain

After 11 iterations, the errors committed by the single forms model are consistently smaller

What this means A given word typically appears in only a handful of forms Those word forms are motivated by constructions and collocations most typical for the word Learning is potentially enhanced by focus only on the most typical word forms attested for given lexemes: accuracy increases and severity of errors decreases

So how can we escape from this overstuffed suitcase? Textbooks have always focused on certain forms and constructions Now we can do this in a scientific, consistent way

Find the 1-3 most common forms of the high-frequency words students need to know Find the grammatical constructions that motivate those 1-3 word forms

1-3 most common forms of high-frequency words We’ve already made some samples for you Each handout lists 9 high-frequency words (≥50 in SynTagRus) For each word, the list shows the 3 most frequent forms Please form pairs or small groups, each group can use 1 of 20 lists

Find the grammatical constructions that motivate those 1-3 word forms Use the Russian National Corpus http://ruscorpora.ru/ Suggest entries for the Learner’s Constructicon for Russian https://spraakbanken.gu.se/karp/#?mode=konstruktikon-rus Let’s try a demo first

слово appears 814 times in SynTagRus Can you guess what its most frequent form is?

слово appears 814 times in SynTagRus Can you guess what its most frequent form is? словам (280 = 34.4%) слова (212 = 26%) слово (118 = 14.5%) We can also try связи…

An Entry in the Constructicon:

An Entry in the Constructicon: NAME renders the construction both schematically and with a brief example

DEFINITION explains the meaning of the construction (all definitions will also be translated into English) An Entry in the Constructicon:

An Entry in the Constructicon: STRUCTURE provides a dependency grammar analysis of both the schematic and brief example renderings of the construction

At least three corpus EXAMPLES illustrate the construction An Entry in the Constructicon: At least three corpus EXAMPLES illustrate the construction

An Entry in the Constructicon: CEFR is the Common European Framework of Reference for Languages level to guide learners and instructors

Try out your lists!

What can we do? Use corpora to find the word forms that are most strategic for our students Crowdsource the Constructicon for Russian Build learning materials that focus on the typical word forms, avoiding unlikely word forms Thank you!