Language use as a window to understand L1 differences in L2 writing

Slides:



Advertisements
Similar presentations
Language Assessment What it measures and how Jill Kerper Mora, Ed.D.
Advertisements

English only VS. L1 support Looking at English language acquisition of native Spanish speaking children.
A Report on the First Native Language Identification Shared Task Joel Tetreault Nuance Communications Daniel Blanchard Educational Testing Service Aoife.
PSY 369: Psycholinguistics
Using Exploratory and Confirmatory Factor Analysis to Understand Social Attitudes Paula Surridge Dept. of Sociology University of Bristol
< Translator Team > 25+ Languages, …and growing!.
Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence.
Measurement Models and CFA; Chi-square and RMSEA Ulf H. Olsson Professor of Statistics.
Measurement Models and CFA Ulf H. Olsson Professor of Statistics.
Wannapa Trakulkasemsuk A Comparative Analysis of English Feature Articles in Magazines Published in Thailand and Britain : Linguistic Aspects.
Using the SILL to Record the Language Learning Strategy Use: Suggestions for the Greek EFL Population Dr. Vassilia Kazamia-Christou Aristotle University.
Statistics for Education Research Lecture 10 Reliability & Validity Instructor: Dr. Tung-hsien He
An example of a CFA: Evaluating an existing scale.
Statistical learning, cross- constraints, and the acquisition of speech categories: a computational approach. Joseph Toscano & Bob McMurray Psychology.
Measurement Models: Exploratory and Confirmatory Factor Analysis James G. Anderson, Ph.D. Purdue University.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Measurement Models: Identification and Estimation James G. Anderson, Ph.D. Purdue University.
Unique Contributions or Measurement Error? Applying a Bi-factor Structural Equation Model to Investigate the Roles of Morphological Awareness and Vocabulary.
Fundamental Statistics in Applied Linguistics Research Spring 2010 Weekend MA Program on Applied English Dr. Da-Fu Huang.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Outline of Today’s Discussion 1.The Chi-Square Test of Independence 2.The Chi-Square Test of Goodness of Fit.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Target Situation Analysis
Chapter 17 STRUCTURAL EQUATION MODELING. Structural Equation Modeling (SEM)  Relatively new statistical technique used to test theoretical or causal.
English Language Learners In Our Classrooms. The New Face of ESL ESL TEACHERS: Rebekkah Kemp Joyce Metallo Michelle Wesbrook.
Statistics for Psychology CHAPTER SIXTH EDITION Statistics for Psychology, Sixth Edition Arthur Aron | Elliot J. Coups | Elaine N. Aron Copyright © 2013.
Most Professional Translation Services provider in USA.
IS THE IDIOM PRINCIPLE BLOCKED IN BILINGUAL L2 PRODUCTION? Hiroki Tsuchimochi.
CFA: Basics Byrne Chapter 3 Brown Chapter 3 (40-53)
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
An Evaluation of an Observation Rubric Used to Assess Teacher Performance Kent Sabo Kerry Lawton Hongxia Fu Arizona State University.
Using Exploratory and Confirmatory Factor Analysis to Understand Social Attitudes Paula Surridge Dept. of Sociology University of Bristol
Language Identification and Part-of-Speech Tagging
Corpora and language learning
Unsupervised Learning
Automatic Writing Evaluation
The Nuts and Bolts of Willingness to Use Technology
Collecting Written Data
Measuring Monolinguality
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
In the Name of God.
Statistical NLP: Lecture 7
Structural Equation Modeling using MPlus
Chapter 15 Confirmatory Factor Analysis
Ma Rui Tianjin Normal University
Yu-Chi Tai, Shun-nan Yang, John Hayes , James Sheedy
Anik Wulyani, PhD candidate
Computational and Statistical Methods for Corpus Analysis: Overview
Introduction to Corpus Linguistics: Exploring Collocation
Shudong Wang NWEA Liru Zhang Delaware Department of Education
PATHWAYS TO MINORS IN FRENCH & SPANISH DEPARTMENT OF MODERN LANGUAGES
Vector-Space (Distributional) Lexical Semantics
Lexical: Words vs. Characters Syntactic and Stylistic
A CORPUS-BASED STUDY OF COLLOCATIONS OF HIGH-FREQUENCY VERB —— MAKE
Linguistic Predictors of Cultural Identification in Bilinguals
Statistical NLP: Lecture 9
If and only if…: a corpus-based investigation of lexical bundles use by expert and novice mathematics writers By Abdullah Alasmary Assistant professor.
Statistical n-gram David ling.
Confirmatory Factor Analysis
Using GOLD to Tracking L2 Development
An Investigation into the Developmental Features of Chinese EFL Learners’ Use of Amplifier Collocations Wang Haihua School of Foreign Languages Dalian.
COUNTRIES NATIONALITIES LANGUAGES.
Exit Ticket: BICS AND CALPS
The Nature Of Learner Language
ENGLISH AS A FOREIGN, SECOND, AND INTERNATIONAL LANGUAGE (EFL, ESL, EIL) Indawan Syahri 6/9/2019.
SEM: Step by Step In AMOS and Mplus.
Statistical NLP : Lecture 9 Word Sense Disambiguation
Testing Causal Hypotheses
Unsupervised Learning
Ungraded quiz Unit 8.
Presentation transcript:

Language use as a window to understand L1 differences in L2 writing Liberato Santos, Roz Hirch, and Sowmya Vajjala Iowa State University

Native Language Identification (NLI) Goal: Identify and classify L1s based on L2 writing Assumption: L1 plays active role in L2 acquisition and production Where is NLI useful? Customized ESL instruction to learners with specific L1s Stylistic studies Forensic linguistics Existing work: Systematic lexical and phrasal choices by L1 groups (Kyle, Crossley, & Kim, 2015) The role of n-grams/lexical bundles in L1 identification (Jarvis & Paquot, 2012) L2 learners’ idiosyncratic use of lexical bundles (Paquot, 2013) (Paquot, 2013) How many times have you heard what someone said, or read what some wrote, and then you thought, "Oh, I think I know where this person is from from based on the words they use"? When it comes to Non-Native English Speakers (NNES), some of us have an intuition that they choose certain words when they speak or write because of their L1, OR we think that their L1 influences how they speak and/or write. This research might shed some light on these intuitions. L1 identification: present it briefly (add references)

Research Questions RQ #1: Are there latent groupings of L1s based on how frequently they use the same L2 trigrams? RQ #2: Do these L1 groupings occur similarly across different L2 proficiency levels?

Corpus: TOEFL 11 (Blanchard et al., 2013) ESL student writing from 11 language (L1) groups: ARA, CHI, FRE, GER, HIN, ITA, JPN, KOR, SPA, TEL, TUR Training data: 11,000 txt files (3.5 million word tokens, 55k word types, 1k files per L1) Test data: 1,100 txt files (345k word tokens, 14k word types, 100 files per L1) Trigrams - what they are, what they look like, why trigrams and not four-grams

Language Feature: word trigrams from student essays Trigrams - what they are, what they look like, why trigrams and not four-grams

Language Feature: word trigrams from student essays Trigrams - what they are, what they look like, why trigrams and not four-grams

Pre-processing Language feature: Word trigrams we considered language- specific (and not prompt-specific) Pre-processing: Full trigram list: all trigrams + frequencies from entire corpus (3.5M tokens) Generated 101 prompt-specific trigrams & removed from full list Total trigrams: 521 Normalization: per 100K Trigrams - what they are, what they look like, why trigrams and not four-grams

Methods: Statistical Analysis Statistics #1: Exploratory Factor Analysis (EFA) Done on training data Helps identify patterns in the data L1 groupings: frequency of L2 trigram usage by L1 group

Results: 2-factor EFA on training data

Methods: Statistical Analysis Statistics #2: Confirmatory Factor Analysis (EFA) Done on test data Can the 2-factor EFA be confirmed by a CFA?

Results: Does CFA confirm EFA? CFA model fit: RMR: .137 SRMR: .0448 CFI: .937 RMSEA: .102 Chi: 230.633 p = .000 DF: 40 MODEL FIT: SRMR values below .10 are indicative of good model fit CFI: .95 = good fit, and .9 = marginal/acceptable fit RMSEA: .06 and lower is considered good; between .06 and .08 is considered acceptable Chi-square: A significant Chi at p = .000 indicates the two models are different, which is not what we wan

Results: Does CFA confirm EFA? CFA model fit: RMR: .387 SRMR: .0723 CFI: .810 RMSEA: .101 Chi: 781.879 p = 000 DF: 203 MODEL FIT: SRMR values below .10 are indicative of good model fit CFI: .95 = good fit, and .9 = marginal/acceptable fit RMSEA: .06 and lower is considered good; between .06 and .08 is considered acceptable Chi-square: A significant Chi at p = .000 indicates the two models are different, which is not what we wan

Conclusions so far EFA & CFA: Moderate fit Trigrams indicate 2 groups of L1s (overlap: CHI, JPN, KOR, ARA, TUR) Groupings: (1) L1s (2) L1s across proficiency levels EFA & CFA: Moderate fit What other statistics can help analyze this dataset?

Hierarchical Agglomerative Cluster Analysis (HAC) Exploratory method Frequency measures: Similarity matrix Euclidean, Ward’s method Simple to understand: Visual (Dendrogram, Heatmap) Problems: Potentially susceptible to poor early combinations Small samples lack stability

HAC – 11 languages Training Data Test Data ITA GER FRE SPA HIN TEL JPN KOR TUR ARA CHI Test Data

HAC – Languages by Levels Training Data Test Data

HAC - Heatmap (test data - 11 languages)

HAC - Heatmap I agree with the statement... Japanese German Spanish Chinese Telugu French Korean Italian Arabic Turkish Hindi HAC - Heatmap with the statement agree with the I agree with I think that a lot of I agree with the statement...

Conclusions L1 groupings are real L1 groupings can be identified across proficiency levels EFA and HAC have similar results for 2-factor model, except for Arabic

Future steps New research question: Are these observations generalizable? Or are they specific to a corpus? Test CFA model on different student data Does it support a 3- or 4-factor model? Going beyond words: Looking at deeper syntactic patterns (e.g., POS, phrase structure, long-distance dependencies)

Acknowledgements Peer Review Group (PRG) Brown Bag team Kelly Cunningham, Kim Becker, Idée Edalatishams, Erin Todey Brown Bag team Ananda Muhammad, Erin Todey ISU Engl. Dept. Faculty Gary Ockey, Bethany Gray