Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language use as a window to understand L1 differences in L2 writing

Similar presentations


Presentation on theme: "Language use as a window to understand L1 differences in L2 writing"— Presentation transcript:

1 Language use as a window to understand L1 differences in L2 writing
Liberato Santos, Roz Hirch, and Sowmya Vajjala Iowa State University

2 Native Language Identification (NLI)
Goal: Identify and classify L1s based on L2 writing Assumption: L1 plays active role in L2 acquisition and production Where is NLI useful? Customized ESL instruction to learners with specific L1s Stylistic studies Forensic linguistics Existing work: Systematic lexical and phrasal choices by L1 groups (Kyle, Crossley, & Kim, 2015) The role of n-grams/lexical bundles in L1 identification (Jarvis & Paquot, 2012) L2 learners’ idiosyncratic use of lexical bundles (Paquot, 2013) (Paquot, 2013) How many times have you heard what someone said, or read what some wrote, and then you thought, "Oh, I think I know where this person is from from based on the words they use"? When it comes to Non-Native English Speakers (NNES), some of us have an intuition that they choose certain words when they speak or write because of their L1, OR we think that their L1 influences how they speak and/or write. This research might shed some light on these intuitions. L1 identification: present it briefly (add references)

3 Research Questions RQ #1: Are there latent groupings of L1s based on how frequently they use the same L2 trigrams? RQ #2: Do these L1 groupings occur similarly across different L2 proficiency levels?

4 Corpus: TOEFL 11 (Blanchard et al., 2013)
ESL student writing from 11 language (L1) groups: ARA, CHI, FRE, GER, HIN, ITA, JPN, KOR, SPA, TEL, TUR Training data: 11,000 txt files (3.5 million word tokens, 55k word types, 1k files per L1) Test data: 1,100 txt files (345k word tokens, 14k word types, 100 files per L1) Trigrams - what they are, what they look like, why trigrams and not four-grams

5 Language Feature: word trigrams from student essays
Trigrams - what they are, what they look like, why trigrams and not four-grams

6 Language Feature: word trigrams from student essays
Trigrams - what they are, what they look like, why trigrams and not four-grams

7 Pre-processing Language feature: Word trigrams we considered language- specific (and not prompt-specific) Pre-processing: Full trigram list: all trigrams + frequencies from entire corpus (3.5M tokens) Generated 101 prompt-specific trigrams & removed from full list Total trigrams: 521 Normalization: per 100K Trigrams - what they are, what they look like, why trigrams and not four-grams

8 Methods: Statistical Analysis
Statistics #1: Exploratory Factor Analysis (EFA) Done on training data Helps identify patterns in the data L1 groupings: frequency of L2 trigram usage by L1 group

9 Results: 2-factor EFA on training data

10 Methods: Statistical Analysis
Statistics #2: Confirmatory Factor Analysis (EFA) Done on test data Can the 2-factor EFA be confirmed by a CFA?

11 Results: Does CFA confirm EFA?
CFA model fit: RMR: .137 SRMR: .0448 CFI: .937 RMSEA: .102 Chi: p = .000 DF: 40 MODEL FIT: SRMR values below .10 are indicative of good model fit CFI: .95 = good fit, and .9 = marginal/acceptable fit RMSEA: .06 and lower is considered good; between .06 and .08 is considered acceptable Chi-square: A significant Chi at p = .000 indicates the two models are different, which is not what we wan

12 Results: Does CFA confirm EFA?
CFA model fit: RMR: .387 SRMR: .0723 CFI: .810 RMSEA: .101 Chi: p = 000 DF: 203 MODEL FIT: SRMR values below .10 are indicative of good model fit CFI: .95 = good fit, and .9 = marginal/acceptable fit RMSEA: .06 and lower is considered good; between .06 and .08 is considered acceptable Chi-square: A significant Chi at p = .000 indicates the two models are different, which is not what we wan

13 Conclusions so far EFA & CFA: Moderate fit
Trigrams indicate 2 groups of L1s (overlap: CHI, JPN, KOR, ARA, TUR) Groupings: (1) L1s (2) L1s across proficiency levels EFA & CFA: Moderate fit What other statistics can help analyze this dataset?

14 Hierarchical Agglomerative Cluster Analysis (HAC)
Exploratory method Frequency measures: Similarity matrix Euclidean, Ward’s method Simple to understand: Visual (Dendrogram, Heatmap) Problems: Potentially susceptible to poor early combinations Small samples lack stability

15 HAC – 11 languages Training Data Test Data ITA GER FRE SPA HIN TEL JPN
KOR TUR ARA CHI Test Data

16 HAC – Languages by Levels
Training Data Test Data

17 HAC - Heatmap (test data - 11 languages)

18 HAC - Heatmap I agree with the statement... Japanese German Spanish
Chinese Telugu French Korean Italian Arabic Turkish Hindi HAC - Heatmap with the statement agree with the I agree with I think that a lot of I agree with the statement...

19 Conclusions L1 groupings are real
L1 groupings can be identified across proficiency levels EFA and HAC have similar results for 2-factor model, except for Arabic

20 Future steps New research question: Are these observations generalizable? Or are they specific to a corpus? Test CFA model on different student data Does it support a 3- or 4-factor model? Going beyond words: Looking at deeper syntactic patterns (e.g., POS, phrase structure, long-distance dependencies)

21 Acknowledgements Peer Review Group (PRG) Brown Bag team
Kelly Cunningham, Kim Becker, Idée Edalatishams, Erin Todey Brown Bag team Ananda Muhammad, Erin Todey ISU Engl. Dept. Faculty Gary Ockey, Bethany Gray


Download ppt "Language use as a window to understand L1 differences in L2 writing"

Similar presentations


Ads by Google