A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment.

Slides:



Advertisements
Similar presentations
Richard M. Jacobs, OSA, Ph.D.
Advertisements

Chapter 2 Exploring Data with Graphs and Numerical Summaries
Descriptive Measures MARE 250 Dr. Jason Turner.
IB Math Studies – Topic 6 Statistics.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
Descriptive Statistics: Numerical Measures
Descriptive Statistics A.A. Elimam College of Business San Francisco State University.
1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Five-Number Summary 1 Smallest Value 2 First Quartile 3 Median 4
Slides by JOHN LOUCKS St. Edward’s University.
1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of.
Chapter 2 Simple Comparative Experiments
1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.
Chapter 5 – 1 Chapter 5: Measures of Variability The Importance of Measuring Variability The Range IQR (Inter-Quartile Range) Variance Standard Deviation.
Chapter 5 – 1 Chapter 5 Measures of Variability The importance of measuring variability IQV (index of qualitative variation) The range IQR (inter-quartile.
1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
8/20/2015Slide 1 SOLVING THE PROBLEM The two-sample t-test compare the means for two groups on a single variable. the The paired t-test compares the means.
CHAPTER 39 Cumulative Frequency. Cumulative Frequency Tables The cumulative frequency is the running total of the frequency up to the end of each class.
Describing Data: Numerical
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Drawing and comparing Box and Whisker diagrams (Box plots)
LECTURE 12 Tuesday, 6 October STA291 Fall Five-Number Summary (Review) 2 Maximum, Upper Quartile, Median, Lower Quartile, Minimum Statistical Software.
Chapter 3 - Part B Descriptive Statistics: Numerical Methods
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
The literacy divide: territorial differences in the Italian education system Claudio QUINTANO, Rosalia CASTELLANO, Sergio LONGOBARDI University of Naples.
1 1 Slide © 2009 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.
Measures of Central Tendency & Spread
CHAPTER 1 Basic Statistics Statistics in Engineering
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)
Anthony J Greene1 Dispersion Outline What is Dispersion? I Ordinal Variables 1.Range 2.Interquartile Range 3.Semi-Interquartile Range II Ratio/Interval.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Statistical Tools in Evaluation Part I. Statistical Tools in Evaluation What are statistics? –Organization and analysis of numerical data –Methods used.
Chapter 7 Item Analysis In constructing a new test (or shortening or lengthening an existing one), the final set of items is usually identified through.
Measures of Central Tendency and Dispersion Preferred measures of central location & dispersion DispersionCentral locationType of Distribution SDMeanNormal.
STA Lecture 131 STA 291 Lecture 13, Chap. 6 Describing Quantitative Data – Measures of Central Location – Measures of Variability (spread)
Chapter 5 – 1 Chapter 5: Measures of Variability The Importance of Measuring Variability IQV (Index of Qualitative Variation) The Range IQR (Inter-Quartile.
Ratio A comparison of two numbers by division 4 out of 5 people choose product X 4 out of 5 4 to 5 4:5.
1 1 Slide Slides Prepared by JOHN S. LOUCKS St. Edward’s University © 2002 South-Western/Thomson Learning.
1.1 EXPLORING STATISTICAL QUESTIONS Unit 1 Data Displays and Number Systems.
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
UTOPPS—Fall 2004 Teaching Statistics in Psychology.
Measure of Central Tendency Measures of central tendency – used to organize and summarize data so that you can understand a set of data. There are three.
Research Methods. Measures of Central Tendency You will be familiar with measures of central tendency- averages. Mean Median Mode.
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
Numerical Measures. Measures of Central Tendency (Location) Measures of Non Central Location Measure of Variability (Dispersion, Spread) Measures of Shape.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons. 3-1 Business Statistics, 4e by Ken Black Chapter 3 Descriptive Statistics.
Statistics Outline I.Types of Error A. Systematic vs. random II. Statistics A. Ways to describe a population 1. Distribution 1. Distribution 2. Mean, median,
Q ROME, JULY 2008 Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from.
Educational Research: Data analysis and interpretation – 1 Descriptive statistics EDU 8603 Educational Research Richard M. Jacobs, OSA, Ph.D.
Economics 111Lecture 7.2 Quantitative Analysis of Data.
Data Analysis. Qualitative vs. Quantitative Data collection methods can be roughly divided into two groups. It is essential to understand the difference.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
StatisticsStatistics Unit 5. Example 2 We reviewed the three Measures of Central Tendency: Mean, Median, and Mode. We also looked at one Measure of Dispersion.
Central Tendency  Key Learnings: Statistics is a branch of mathematics that involves collecting, organizing, interpreting, and making predictions from.
Notes 13.2 Measures of Center & Spread
Teaching Statistics in Psychology
Chapter 2 Simple Comparative Experiments
Chapter 5 STATISTICS (PART 1).
Description of Data (Summary and Variability measures)
Course Contents 1. Introduction to Statistics and Data Analysis
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
The absolute value of each deviation.
Principal Components Analysis
Investigations: Box Plots
Presentation transcript:

A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores Claudio Quintano, Rosalia Castellano, Sergio Longobardi UNIVERSITY OF NAPLES “PARTHENOPE”

A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” OUTLINE This work considers data on students’ performance assessments collected by the Italian National Evaluation Institute of the Ministry of Education (INVALSI) OUTLIER UNITS, at class level, which brings to biased distributions of the average scores by class OUTLIER UNITS, at class level, which brings to biased distributions of the average scores by class The AIM is to MITIGATE THE PRESENCE of outliers and correcting the overestimation of children ability The AIM is to MITIGATE THE PRESENCE of outliers and correcting the overestimation of children ability THE INVALSI SURVEY 3 AREAS reading, mathematics and science 5 SCHOOL LEVELS –2 th and 4 th year of primary school –1 th year of lower secondary –1 th and 3 th year of upper secondary

A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” MATHEMATICS CLASS MEAN SCORE - S.Y 2004/05 III CLASS UPPER SECONDARY SCHOOL I CLASS UPPER SECONDARY SCHOOL I CLASS LOWER SECONDARY SCHOOL IV CLASS PRIMARY SCHOOL II CLASS PRIMARY SCHOOL DISTRIBUTIONS OF MEAN SCORES AT CLASS LEVEL (MATHEMATICS ASSESSMENT)

A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” CLASS MEAN SCORE II CLASS - PRIMARY SCHOOL Reading s.y. 2004/05 Mathematics s.y. 2004/05 Science s.y. 2004/05 Reading s.y. 2005/06 Mathematics s.y. 2005/06 Science s.y. 2005/06

A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” Deletion of micro units –students- considered as “PSEUDO NON RESPONDENTS” Students who haven’t given the minimum number of answers to compute a performance score The presence of these units varies from 9% to 16% STEP I

A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” Class mean score : COMPUTATION OF CLASS LEVEL INDICATOR SCORE OF I TH STUDENT OF J TH CLASS NUMBER OF RESPONDENT STUDENTS OF J TH CLASS For each student class the following indexes are computed: Standard deviation of mean score Class non response rate NUMBER BOTH OF ITEM NON REPSONSES AND OF INVALID RESPONSES FOR THE I TH STUDENT OF THE J TH CLASS NUMBER OF RESPONDENT STUDENTS OF J TH CLASS NUMBER OF ADMINISTERED ITEMS TO J TH CLASS Index of answers’ homogeneity GINI MEASURE OF HETEROGENEITY COMPUTED FOR EACH S TH TEST QUESTION ADMINISTERED TO EACH STUDENT OF J TH CLASS SUMMARY At first step the micro units considered as “pseudo-non respondents” have been dropped from dataset then the following indexes, at class level, are computed: At first step the micro units considered as “pseudo-non respondents” have been dropped from dataset then the following indexes, at class level, are computed: Class mean score Standard deviation of mean score Class non response rate Index of answers’ homogeneity

A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” PRINCIPAL COMPONENT ANALYSIS (PCA) By the PCA we are able to describe the answer behaviour of each student class through two variables CONTRAPOSITION FIRST Component SECOND Component OUTLIERS IDENTIFICATION AXIS INDEX OF CLASS COLLABORATION TO SURVEY Class non response rate

A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” It is possible to detect, graphically, the outlier classes of students Projection on the first two factorial axes plane of second class primary students PRINCIPAL COMPONENT ANALYSIS (PCA) OUTLIER CLASSES

A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” Computation of fuzzy partition matrix where for each students’ class (rows of the matrix) the degree of belonging to each cluster (columns of the matrix) is computed FUZZY K-MEANS APPROACH THE FUZZY K-MEANS APPROACH On the basis of the two factorial dimensions the students’classes are classified in 8 clusters by a FUZZY K- MEANS algorithm

-10- A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” DETECTION OF OUTLIERS Projection of centroids computed by fuzzy k-means High negative scores on “outliers identification axis” (x-axis) that indicates a high class average scores and minimum within variability respect to scores and test answers OUTLIER CLUSTER Factorial scores close to zero respect to the “index of class collaboration to survey”

-11- A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” Indicating with “a” the outlier cluster, the degree of belonging to this cluster is: µ ja Otherwise it can be interpreted as the “outlier level” of each class This measure is considered as the “outlier probability” of j th class DETECTION OF OUTLIERS

-12- A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” W j varies from 0 to 1 The students’ class with high probability to belong to outlier cluster will have a low weight while the class very far from this cluster will have a weight close to 1 CORRECTION PROCEDURE On the basis of the outlier cluster degree, a weighting factor is developed: a weighting factor is developed: W j =1 - µ ja Weighting factor Outlier probability

-13- A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” EFFECTS OF THE CORRECTION PROCEDURE ORIGINAL DISTRIBUTION ADJUSTED DISTRIBUTION

-14- A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” THE INSPIRATION PRINCIPLE OUTLIER NOT OUTLIER Go over the dichotomous logic FUZZYAPPROACH Compute an “OUTLIER LEVEL” measure for each unit to calibrate the correction

-15- A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” RELATIONSHIP BETWEEN THE SCHOOL LOCALIZATION AND THE PRESENCE OF OUTLIER CLASSES Box plot of outlier level µ ja Degree to belonging to the outlier cluster (cluster n.2)

-16- A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” CLASS AVERAGE SCORE DISTRIBUTIONS ONLY FOR THE NORTHERN AND CENTRAL REGIONS RELATIONSHIP BETWEEN THE SCHOOL LOCALIZATION AND THE PRESENCE OF OUTLIER CLASSES

-17- A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” REGIONAL SCORES NOT WEIGHTED AVERAGE WEIGHTED AVERAGE

-18- A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” denotes the ratio of students of j th class that has given the t th answer to s th question Index of answers’ homogeneity Where E sj is a Gini measure of heterogeneity: The Gini measure is equal to zero when all students of j th class have given the same answer to the s th question. It reaches the maximum value: h-1/h (h is the number of alternative answers to question s th ) when there is perfect heterogeneity of answers to s th question in the j th class The mean of the Q Gini indexes (E sj ) computed for each s th test Question administered to each student of j th class:

-19- A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” Original distributionAdjusted distribution MEAN74,7171,67 MODE100,0068,75 I QUARTILE64,4263,12 MEDIAN73,6171,09 III QUARTILE85,9480,69 KURTOSIS SKEWNESS EFFECTS OF THE CORRECTION PROCEDURE