Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jennifer Doherty, Karen Draney and Andy Anderson

Similar presentations


Presentation on theme: "Jennifer Doherty, Karen Draney and Andy Anderson"— Presentation transcript:

1 Methodological Issues in Developing a Learning Progression-based Assessment System
Jennifer Doherty, Karen Draney and Andy Anderson Michigan State University BEAR Center, UC Berkeley NARST 2012 We’ve shown that we can use consistent patterns in students’ accounts of carbon transforming processes as a basis for coding or classifying students’ accounts, and that the insights into student reasoning afforded by these analyses are useful for guiding instruction.

2 Research Goal Develop a valid and reliable assessment system that measures student progress in terms of learning progression levels. For the work I’ll talk about today, we focus on another potential use of our learning progression framework. We aim to develop a valid and reliable assessment system that measures student progress in terms of learning progression levels.

3 Key Methodological Problem
Each component of the system must meet two sets of criteria: Statistical criteria based on measurement theory and practice Conceptual criteria based on learning progression theory and practice, including conceptual coherence with other components of the assessment system The key methodological problem we must solve to create this system is that each component of the assessment system must meet two sets of criteria. One, Statistical criteria based on measurement theory and practice Two, Conceptual criteria based on learning progression theory and practice (for example, all coding rubrics must be aligned with the learning progression theory)

4 Components of assessment system
We started building our assessment system using the assessment triangle which is a model of the essential connections and dependencies present in a coherent and useful assessment system. Assessment activities (point to the observation vertex) must be aligned with the knowledge and cognitive processes one wishes to affect through the instructional process (point to the cognition vertex), and the scoring and interpretation of student work must reflect measures of the same knowledge and cognitive processes (point to the interpretation vertex). (NRC 2001)

5 Components of assessment system
To implement the assessment triangle, we used the BEAR Assessment System. Using this system, we address the key methodological problem in three ways: (1) by developing a set of assessment items closely aligned with our learning progression framework, (2) by the use of a detailed system for the development and use of coding rubrics to code student responses to assessments, (3) by using a measurement model to verify our predictions for student performance and quantify student performance with respect to the items and the learning progression framework. (Wilson 2005)

6 Component I: Construct Maps: Learning Progression Framework
So, let’s start at the beginning. Component one of the BEAR assessment system: the construct maps. Construct maps are sets of qualitatively ordered levels of performance in a concept or skill of particular importance. Constructs and levels of performance are derived partly from theories about how knowledge and practice are organized and partly from empirical research data. Our view is that a construct map may be seen as a learning progression framework. (Wilson 2005)

7 Component I: Construct Maps: Learning Progression Framework
Learning Progression Hypothesis: Consistency across processes with respect to principles and models. Students who learn scientific discourse see how systems and processes are connected, applying principles and models across processes. Lower-level students will not see scientific connections among processes, but their accounts will have similarities because they draw on a common pool of linguistic and conceptual resources. Our learning progression framework incorporates a number of features that can be described as hypotheses. They are hypotheses not in the sense that they are unsupported by data, but in the sense that we still must confirm their usefulness as foundations for a assessment system. In the presentation I will focus on only one of these hypotheses and its implications for assessment development. Our hypothesis is that student performances in accounting for various carbon transforming processes are linked by underlying approaches to making sense of the world (either scientific discourse or force-dynamic discourse) and that this leads to predictable similarities, or consistency, in their accounts of across processes.

8 Component I: Construct Maps: Learning Progression Framework
Implication of hypothesis: An assessment system should be able to measure levels of proficiency in the ability to: trace matter and energy through systems reason across scales for different processes that are consistent across processes. The implication of this hypothesis for our assessment system is that our system should be able to measure levels of proficiency that are consistent across processes. If our data from the development process are consistent with these implications, then we gain confidence in the validity of the learning progression framework as the basis for a assessment system. On the other hand, if patterns in our data are inconsistent with these predictions, then that may raise questions either about specific items or about the framework itself.

9 Component 2: Item design
Component two of the BEAR assessment system is item design. (Wilson 2005)

10 Component 2: Item design
We want to measure students’ understanding of principles and models with minimal effects from scaffolding and local knowledge. We want to measure students’ understanding of principles and models with minimal effects from scaffolding and local knowledge.

11 Example Items ENERPLNT (energy practice, plant growth process, MTF+CR)
Which of the following is(are) energy source(s) for plants? Circle yes or no for each of the following: Water, Light, Air, Nutrients in soil, Plants make their own energy. Please explain ALL your answers, including why the things you circled “No” for are NOT sources of energy for plants. THINGTREE (mass/gas practice, plant growth process, MTF+CR) A small oak tree was planted in a meadow. After 20 years, it has grown into a big tree, weighing 500 kg more than when it was planted. Do you think the tree will need any of the following things to grow and gain weight? Circle yes or now for each of the following: Sunlight, Soil, Water, Air If you circled yes, explain how the tree uses it. TREEDECAYC (energy practice, decomposer process, CR) A tree falls in the forest. After many years, the tree will appear as a long, soft lump on the forest floor. Is energy involved when the tree decays? Circle one: Yes / No If your answer is yes, please explain how energy is involved. Most of our items have a general format. The student makes a choice which they then have to explain. Here are three example items, two of which are the same process and two of which are the same practice.

12 Component 3: Outcome space (Coding Rubrics)
Component 3 of the assessment system is the outcome space. That is, the set of categorical outcomes into which student performances are categorized for all the items. In practice, these are presented as coding rubrics for student responses to assessment tasks. (Wilson 2005)

13 Component 3: Outcome space (Coding Rubrics)
Challenges: Coding rubrics need to be aligned among items among items of a single type of carbon transforming process (e.g., plant growth item). among items of a single type of practice (e.g., tracing energy). Coding rubrics need to be aligned with the learning progression framework Responses that are coded at level 2 are not just partially incorrect. They must be responses that contain indicators of level 2. Alignment between the theoretical learning progression framework and scoring guides for individual items is crucial and leads to the following challenges. Coding rubrics need to be aligned among items and with the learning progression framework. Typical analysis of student performance using written open response items is to create a scoring rubric for each item that allows you to assign variable credit for different responses. However, rubrics for items are not necessarily linked to one another or to an underlying cognitive theory, such as a learning progression framework. Creating this link (where coding rubrics are aligned with the cognitive theory) is the core challenge of creating coding rubrics to assess the validity of a learning progression framework for an assessment system. Is what we’re calling sophisticated reasoning on one item, equally sophisticated reasoning on other items? (Alonzo et al. 2012)

14 Component 3: Outcome space: Developing Coding Rubrics
Methods: Development of rubrics that combine the general Learning Progression framework levels with item-specific level indicators Iterative process of developmental coding and rubric revising Iterative process of full coding, reliability checks, rubric revising, and recoding So, what have we done to develop these coding rubrics.

15 Interviews and written assessments that assess two dimensions of complex accounts (process and practice) This table, that Hui presented earlier, shows how we designed assessment items in two dimensions: processs and the practices. So when we do our coding rubric development and revision we are always looking to align rubrics vertically and horizontally in this grid, in addition to alignment with our general learning progression framework. Here are our three example items.

16 Component 4: Measurement Model
After we have coded the data using the rubrics (step 3), we then create a measurement model using Item Response Theory. (Wilson 20005)

17 How well do codes of the sample items fit the statistical measurement model?
Discrimination Weighted Mean Square ENERPLNT 0.54 1.02 THINGTREE 0.50 0.88 TREEDECAYC 0.47 1.11 We use this model to assess individual items and how well they fit the measurement model. For example, all of these items show good statistical fit indices as is the case with almost all of our items.

18 We can also assess the individual items in how well their step thresholds fit with other items.
The average thresholds for our model are -1.1 logits for the L1/L2 separation, 0.96 for L2/L3 and, 2.47 for L3/L4. You can see our example items fit pretty well. This means, (for a person) below the lowest line, the most common response is going to be Level 1.  Above the first line but below the second, most common will be 2, but with the occasional 1 or 3 (esp. on hard or easy items).  Above the second line, there will be a mix of  3s and 4s; above the top line, mostly 4s. The L1/L2 threshold (red dots) is separated reasonably well (with a few exceptions which we are investigating).  There’s less distinction between the 2/3 and 3/4 separation -- due in part to the scarcity of 4s in our current data set, but also due to the high difficulty of 3s on some of the items.  Given the means indicated, and the fact that they're 1.5 to 2 logits apart, we can do a reasonable job of classifying persons into the sorts of responses they are most likely to make -- even though on particular items that may not be true.   Even though there's a pattern of general bands, with the means for the step thresholds spaced out the way we hope they would be there are a number of items where the step thresholds are at substantially higher or lower proficiencies than is typical for other items.  This leads to a set of questions about individual items and about the instrument as a whole.

19 Questions about individual items
Check coding rubrics for conceptual validity Check the data on which step thresholds are based Recognize the limitations of individual items What do we make of items with step thresholds that are "out of line" with the others, especially given that these items generally look fine with respect fit to the model.  We can: Check for conceptual validity.  When we look at items, responses, and scoring rubrics, we sometimes see that students are not interpreting the item the way we intended, or that the scoring rubrics are misinterpreting student responses or not well aligned with the framework.  In these cases, we may need to drop the item or recode responses. Our interview data is helpful here. We are also using text analytics software to investigate alignment. Check the data on which step thresholds are based.  Most of the responses were levels 1-3, so especially for items that were administered to fewer students, the L3/4 threshold may be based on very few L4 responses (sometimes only 4-8 students).  These item’s step thresholds may keep the Wright map from looking nice and aligned, but those step thresholds are affecting proficiency estimates for very few students.  We anticipate that this year we should have more high-level students to work with, so we can check the items out with more extensive data. Recognize the limitations of individual items.  Each response provides just a little bit of information about a student's thinking: a couple of sentences at most, more commonly a few words or some forced choices.  So we will probably have to live with some items that are providing useful information about students' proficiencies, but are just easier or harder than others.

20 Questions about whole assessment
Criteria for results of validation analyses Multi-dimensionality and discrimination: a single dimension should account for most of the non-random variance items should be aligned with that dimension. Wright maps and item difficulty: Step thresholds in approximately the same region of vertical space support claim of an underlying level of proficiency. Alignment with student interviews There should be correlation between a students written and interview codes So we will probably never be able to construct a test that consists exclusive of items that are each individually reliable and valid indicators of an individual student's proficiency.  Individual students and individual items will be inconsistent in various ways.  However, that is not the claim we want to make.  What we really want to know is how well students' overall performances on clusters of items or the test as a whole measure more general patterns of discourse and principle-based reasoning.  So that leads us on to questions about the measurement model of the whole assessment. Criteria for results of validation analyses of the measurement model include multidimensionality, Wright maps, and alignment with student interviews. We already talked about the Wright maps so let’s talk about the others.

21 Process Dimension Correlation Matrix
High correlations among processes and practices support claim of an underlying level of proficiency. Animal Function Animal Growth Com- bustion Cross Process Decay Plant Growth 0.900 0.789 0.816 0.754 .828 0.862 0.818 0.764 0.837 Com-bustion 0.847 0.743 0.856 0.924 0.880 High correlations among processes and practices support claim of an underlying level of proficiency. Here you can see the Process dimension correlations, which are quite high. Process Dimension Correlation Matrix

22 Correlation between codes of Interviews and IRT student ability estimates
Another type of validity evidence is to compare the results of the measurement model to another assessment, in this case student interviews. As you can see we have a pretty good correlation between a summative written score and interview tasks.

23 Conclusions and Next Steps
We have multiple lines of evidence that these written assessments do a good job of categorizing most of the students that we have, Level 2 and Level 3 students. Not a lot of good evidence we’re able to discriminate between Level 3 and Level 4 students. Investigate conceptual validity of items that are too hard or too easy We have multiple lines of evidence that these written assessments do a good job of categorizing most of the students that we have, Level 2 and Level 3 students. Not a lot of good evidence we’re able to discriminate between Level 3 and Level 4 students. This could change this year, in our current teaching experiments we have preliminary evidence that more students will be able to achieve level 4. We are also investigating the conceptual validity of items that appear too hard or too easy. In addition to manual rubric analysis, we are using text analytics software to help us investigate.


Download ppt "Jennifer Doherty, Karen Draney and Andy Anderson"

Similar presentations


Ads by Google