Developing Theory-Based Diagnostic Tests of English Grammar: Application of Processability Theory Rosalie Hirch April 26, 2013
Order of the Presentation Introduction Literature Review: Processability Theory (PT) & Diagnostic Language Tests Hierarchies Errors Task Types Method Participants Instruments Analyses Results Discussion, Limitations, & Conclusions
Introduction: Background & Motivation Bridging the gap between testing and the classroom Previous Research in Diagnostic Language Assessment Empirical-based Theory-based Processability Theory Already used for tests (RapidProfile) Is it sufficient for diagnostic tests?
Introduction: Major Goals & Aims of the Study To evaluate the reliability of a diagnostic grammar test for middle school students To explore theoretical approaches to diagnostic language assessment To investigate the application of Processability Theory for diagnostic grammar tests
Literature Review Processability Theory & Diagnostic Language Tests
Processability Theory Hierarchies Processability Theory Based on Lexical Functional Grammar Levels are implicational Levels come from grammar tree Problem: the PT hierarchy is very limited
Processability Theory Hierarchies Processability Theory S’-Procedure S-Procedure S-Procedure (Phrase) Susan decorated a cake while John was playing tennis. (Phrase) Phrasal Procedure (Phrase) N V D N SC N V PrP N Category Procedure Word/ Lemma
Hierarchies Diagnostic Tests Other educational diagnostic tests also use hierarchies Used for analyzing problems Some are implicational Tend to be very broad (covering as much as possible) Suggestion that grammar, in particular, must cover a lot
Processability Theory Errors Processability Theory Learners tend to make 2 types of errors These account for interlanguages Is she at home? (Target Sentence) She Ø at home? (Deletion) She is at home? (Overuse)
Errors Diagnostic Tests The primary focus of diagnostic tests Can potentially show 2 elements in learner performance Where the problem lies (error—observable outcome) What thinking led to the error (weakness—underlying problem) Requires careful planning Before: Item Design After: Rubric Design
Processability Theory Types of Tasks Processability Theory Emphasis on implicit knowledge (automaticity) Based on Levelt’s Speaking Model Tasks tend to be productive (speaking, writing) Analysis is done afterwards
Types of Tasks Diagnostic Tests It is possible to use productive tasks, but not optimal Difficult to control contexts More likely to be discrete and, as a result “inauthentic” Tasks from Norris (2005) and Chapelle et al. (2010) Some qualities of multiple choice Attempt to imitate productive
Research Questions Can we achieve an acceptable level of reliability for the grammatical diagnostic test used for this study? Do the items for the grammatical diagnostic test work well at an item level in terms of item discrimination and difficulty? Were there unexpected patterns? What is the relationship between the subtest, full test, and self-assessment? Were mastery and non-mastery patterns consistent with predictions based on the Processability Theory hierarchy?
Method Participants Instrument Analyses
Participants—Subjects 219 middle school students Outside Seoul No overseas education N % Girls % Boys Grammar Test Writing Test Mean StDev Range Gr. 3-5 72 52.7 47.2 0.46 0.18 0.10-0.85 3.3 1.8 0-7.5 Gr. 6 89 59.6 40.4 0.50 0.20 0.13-0.87 0-8 Gr. 7 39 51.3 48.7 0.47 0.19 0.02-0.79 3.8 1.6 0-7 Gr. 8&9 19 36.8 63.2 0.58 0.22 0.04-0.90 4.2 2.4 Total 219 53.9 46.1 0.49 0.02-0.90 3.5
Participants—Raters 2 rounds of rating Round 1: Grammar All experienced in teaching; 4 in preparing tests Scored the grammar tests and writing tests for the specific grammar points Rated once (absolute answers) Round 2: Holistic 5 Raters All experienced in scoring writing tests Rated twice (3 times where raters differed by 2 or more)
Instruments Grammar Test (see handout) Writing test: picture task Comparison purposes PT grammar and additional levels
Analyses Descriptive Statistics Central tendency & dispersion measures T-unit analysis Test and subsection reliability (Alpha) Item difficulty and discrimination Correlation with the writing test Fit to PT hierarchy
Results
Descriptive Statistics Grammar Test & Writing Test Ave. Word Count Range Word Count Ave t-unit Count Words per t-unit Words per Clause Clauses per Target Clauses 219 67.83 0-242 10.78 6.30 5.69 0.11 0.19 N Items Mean SD Median Mode Range Version 1 219 52 25.6 10.1 25 15 1-47 Version 2 42 20.3 9.0 20 19 0-40 Writing 1 3.0 1.8 4.0 4.5 0-8
Reliability Statistics Grammar Test and Subsections & Writing Test Section Det NC PN Past PrC SVsg SVpl Prep SCA SCB SCC SCT Test PTest Number of items 5 6 4 12 52 42 Alpha score 0.18 0.7 0.88 0.85 0.93 0.92 0.73 0.76 0.74 0.61 0.83 N Correlation Kappa Perfect Agreement Adjacent Scores Perfect+ Adjacent Rho Alpha P-B Proph (3-rater) Writing Test 219 0.92 0.41 0.49 0.99 0.91 0.96 0.98
Item Difficulty and Discrimination Grammar Test Index Item Numbers
Correlation with the Writing Test Grammar Test and Subsections PlN Past PrC SVSg SVPl Prep SubCl Test Total Writing Score A B C Tot 1 .37** .29** .34** SVsg .28** .42** .43** SVpl .38** .36** .27** .25** .33** .46** .45** .26** SCA .21** .40** .53** SCB .23** .56** SCC .15* .18** .39** .11 .50** SCT .48** .60** .87** .86** .69** Test .55** .65** .70** .75** .51** .73** .67** .64** .76** Writing .44** .47** .31** .61** **. Correlation is significant at the 0.01 level (2-tailed). *. Correlation is significant at the 0.05 level (2-tailed).
Fit to Implicational Hierarchies 𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑐𝑒𝑙𝑙𝑠−𝐸𝑥𝑐𝑒𝑝𝑡𝑖𝑜𝑛𝑠 𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑐𝑒𝑙𝑙𝑠 =90%+ Coefficient of Scalability: PT Only=94.1% PT + Proposed Levels=89.3% 1 2 3 N 3 levels 5 2 levels 8 1 level 10 0 levels
Discussion, Limitations, & Conclusion
Discussion Overall reliability was quite good Determiner and non-count section did not work Exposed a problem with determiners generally Task-types have good potential for diagnostic information Grammar correlated fairly well with writing scores Follows from complexity and accuracy May also explain determiners & non-count nouns Fit to PT of proposed levels suggests tasks are plausible
Limitations Results are generalizable only to Koreans Methods may be universal Should have had a larger writing sample Also, more feedback from students and teachers More high-level students
Conclusions Most of the grammar tasks can work well, but require more planning & research Particular attention on error types It may be possible to expand the PT hierarchy Needed in order to be useful for diagnostic purposes