Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 1 Statistical Methods in Computer Science Data 1: Frequency Distributions Ido.

Slides:



Advertisements
Similar presentations
Richard M. Jacobs, OSA, Ph.D.
Advertisements

Learning Objectives In this chapter you will learn about measures of central tendency measures of central tendency levels of measurement levels of measurement.
Population Population
Statistics It is the science of planning studies and experiments, obtaining sample data, and then organizing, summarizing, analyzing, interpreting data,
DENSITY CURVES and NORMAL DISTRIBUTIONS. The histogram displays the Grade equivalent vocabulary scores for 7 th graders on the Iowa Test of Basic Skills.
Statistical Methods in Computer Science Hypothesis Life-cycle Ido Dagan.
Statistics.
Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.
Some Basic Concepts Schaum's Outline of Elements of Statistics I: Descriptive Statistics & Probability Chuck Tappert and Allen Stix School of Computer.
Descriptive Statistics Chapter 3 Numerical Scales Nominal scale-Uses numbers for identification (student ID numbers) Ordinal scale- Uses numbers for.
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 1 Statistical Methods in Computer Science Descriptive Statistics Data 1: Frequency.
Descriptive Statistics
Frequency Distributions
Statistical Methods in Computer Science Data 3: Correlations and Dependencies Ido Dagan.
CHAPTER 6 Statistical Analysis of Experimental Data
Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.
Levels of Measurement Nominal measurement Involves assigning numbers to classify characteristics into categories Ordinal measurement Involves sorting objects.
Percentiles and Percentile Ranks and their Graphical Representations
Today: Central Tendency & Dispersion
Some Introductory Statistics Terminology. Descriptive Statistics Procedures used to summarize, organize, and simplify data (data being a collection of.
Objective To understand measures of central tendency and use them to analyze data.
BIOSTATISTICS II. RECAP ROLE OF BIOSATTISTICS IN PUBLIC HEALTH SOURCES AND FUNCTIONS OF VITAL STATISTICS RATES/ RATIOS/PROPORTIONS TYPES OF DATA CATEGORICAL.
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
With Statistics Workshop with Statistics Workshop FunFunFunFun.
Graphs of Frequency Distribution Introduction to Statistics Chapter 2 Jan 21, 2010 Class #2.
Statistics and Research methods Wiskunde voor HMI Betsy van Dijk.
Data Presentation.
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Statistical Tools in Evaluation Part I. Statistical Tools in Evaluation What are statistics? –Organization and analysis of numerical data –Methods used.
Review Which of these is a parameter? A.The average height of all people B.The time it takes rat #3 to learn the maze C.The number of subjects in your.
Scores & Norms Derived Scores, scales, variability, correlation, & percentiles.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
 Frequency Distribution is a statistical technique to explore the underlying patterns of raw data.  Preparing frequency distribution tables, we can.
Dr. Asawer A. Alwasiti.  Chapter one: Introduction  Chapter two: Frequency Distribution  Chapter Three: Measures of Central Tendency  Chapter Four:
Descriptive Statistics
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
Chapter 4 – 1 Chapter 4: Measures of Central Tendency What is a measure of central tendency? Measures of Central Tendency –Mode –Median –Mean Shape of.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
TYPES There are several TYPES of variables that reflect characteristics of the data Ratio Interval Ordinal Nominal.
Topics for our first Seminar The readings are Chapters 1 and 2 of your textbook. Chapter 1 contains a lot of terminology with which you should be familiar.
Statistics Without Fear! AP Ψ. An Introduction Statistics-means of organizing/analyzing data Descriptive-organize to communicate Inferential-Determine.
IE(DS)1 Descriptive Statistics Data - Quantitative observation of Behavior What do numbers mean? If we call one thing 1 and another thing 2 what do we.
Chapter 3: Organizing Data. Raw data is useless to us unless we can meaningfully organize and summarize it (descriptive statistics). Organization techniques.
Chapter 2: Frequency Distributions. Frequency Distributions After collecting data, the first task for a researcher is to organize and simplify the data.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall2(2)-1 Chapter 2: Displaying and Summarizing Data Part 2: Descriptive Statistics.
LIS 570 Summarising and presenting data - Univariate analysis.
Introduction to statistics I Sophia King Rm. P24 HWB
Educational Research: Data analysis and interpretation – 1 Descriptive statistics EDU 8603 Educational Research Richard M. Jacobs, OSA, Ph.D.
Measurements Statistics WEEK 6. Lesson Objectives Review Descriptive / Survey Level of measurements Descriptive Statistics.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Basic Statistics for Testing. Why we need statistics Types of scales Frequency distributions Percentile ranks.
REVIEW OF BASIC STATISTICAL CONCEPTS Kerstin Palombaro PT, PhD, CAPS HSED 851 PRIVITERA CHAPTERS 1-4.
Educational Research Descriptive Statistics Chapter th edition Chapter th edition Gay and Airasian.
Anthony J Greene1 Distributions of Variables I.Properties of Variables II.Nominal Data & Bar Charts III.Ordinal Data IV.Interval & Ratio Data, Histograms.
Chapter 4: Measures of Central Tendency. Measures of central tendency are important descriptive measures that summarize a distribution of different categories.
Outline Sampling Measurement Descriptive Statistics:
Distributions.
Measurements Statistics
Tips for exam 1- Complete all the exercises from the back of each chapter. 2- Make sure you re-do the ones you got wrong! 3- Just before the exam, re-read.
Descriptive Statistics
Descriptive Statistics
Introduction to Statistics
Basic Statistical Terms
LESSON 3: CENTRAL TENDENCY
Ms. Saint-Paul A.P. Psychology
Population Population
Review for Exam 1 Ch 1-5 Ch 1-3 Descriptive Statistics
Population Population
Presentation transcript:

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 1 Statistical Methods in Computer Science Data 1: Frequency Distributions Ido Dagan

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 2 Concrete Theory: Relates Variables to Each Other Examples: Mathematically accurate Memory = 2*sizeof(input) + 3 Runtime = *sizeof(input) + 20 Asymptotically correct Memory = O(sizeof(input)) in worst case, Runtime = O(log (sizeof(input))) in best case Accuracy is proportional to run-time Qualitative User performance is increased with reduced cognitive load number of bugs discovered is monotonically decreasing, but positive, if the same programmer is used, otherwise, it increases

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 3 Behavior Parameters/Variables (typical of Computer Science) Hardware parameters CPU model and organization, cache organization, latencies in the system System parameters Memory availability, usage CPU running time (sometimes approximated by world-clock time) Communication bandwidth, usage Program characteristics requires floating-point, heavy disk usage, integer math, graphics large heap, large stack, uses non-local information,...

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 4 Additional Behavior Variables Algorithm parameters: Algorithm choice, correctness/accuracy of results Performance curves (accuracy vs. run-time) Size of input Worst case, best case, average case (!!) Other Development person-hours User (programmer) satisfaction, productivity Lines of code, number of components,... Robotics: Speed of movement, accuracy of positioning Learning: precision and recall

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 5 Scales of Measurements Nominal (also called categorical): No order, just labels e.g., “Algorithm Name” Ordinal (also called rank): Order, but not numerical Difference between ranks is not necessarily the same e.g., ranks in (hierarchical/military) organization Interval: Difference between values has same meaning everywhere e.g., temperature in Celsius (rise of 10 degrees is the same everywhere) But 100C is not twice as hot as 50C, and 0C is not lack of heat Ratio: Interval + Fixed zero point e.g., temperature in Kelvin, robot position, memory usage, run-time

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 6 Scale Hierarchy Nominal < Ordinal < Interval < Ratio Propositions that are true for some level, are true above it But not necessarily the other way around e.g., we can calculate the mean (average) value for numerical variables But not for nominal and ordinal e.g., we can calculate the most frequent value for all variables “Numerical”

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 7 Variables Discrete: Can take on only certain values: symbols, exact numbers For ordinal, interval and ratio scales, this means there will be gaps e.g., User satisfaction surveys, memory usage Continuous: Can take on any value within its range: no gaps e.g., run-time, CPU temperature, robot velocity and position In practice: limited by measurement accuracy Up to researcher to determine needed accuracy, approximate carefully

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 8 Data The collection of values that a variable X took during the measurement

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 9 Describing Data Our task: Describe the data we have collected Find ways to characterize it, represent it Find properties that are true of the data So that we can relate the values to those of other variables

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 10 Data Distribution The collection of data is called the sample distribution We will investigate distributions: Find values that “best” represent a distribution Measure their dispersion, range, shape Identify extraordinary values in a distribution Find visual representations for a distribution Remember hierarchy: Nominal < Ordinal < Interval < Ratio Think about how the following techniques apply

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 11 Frequency Distribution Examine the frequency of values f(x) = # of times variable took on value x.

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 12 Frequency Distribution Examine the frequency of values f(x) = # of times variable took on value x. ?

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 13 Frequency Distribution Examine the frequency of values f(x) = # of times variable took on value x. Convention (Ordinal/Numerical): Sort by value

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 14 Grouped Frequency Distributions In ordinal/numerical variables, possible to group values together Create Grouped Frequency Distributions

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 15 Grouped Frequency Distributions In ordinal/numerical variables, possible to group values together Create Grouped Frequency Distributions Warning: Loss of Information

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 16 Real and Apparent Limits Continuous values are more difficult to divide into intervals Score of 95 falls within 95-99, not within But what about temperature of ? 94 < < 95 ! By convention, the real limits of a score are within ½ the measurement resolution If our resolution is 0.1, then limits are within 0.05 If our resolution 100, then limits are within 50 We break convention only for exceptional cases e.g., age: “I am 35” is true of (not including 36).

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 17 Real/Apparent Limits For example: Resolution of Interval really covers values to Apparent limits: Real limits: to Resolution of 10: really covers values 735 to 805.

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 18 Relative Frequency Distributions A frequency count can be misleading Algorithm X was fastest on 60,000 trials: Is this good? 100,000 people voted for candidate A: Is she the winner? We need a way to compare values, i.e., relate them to each other Relative frequency distributions: translate f into percentage or ratio rel f (propor) = f/N rel f (%) = 100 * f/N Warning: Can be misleading, if ignoring count magnitude 50% of all test cases succeeded (with only two cases…)

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 19 Relative Frequency Distributions f/N Example:

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 20 Cumulative Frequency Distribution For ordinal/numerical variables Where values are with respect to others: How many below or above Cumulative frequency distribution

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 21 Cumulative Frequency Distribution Based on the cumulative distribution, can answer question such as: What percentage of scores fall below 80? How many scores below 95?

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 22 Percentiles, Percentile Ranks Percentile X: Value for which X percent of values are lower e.g. baby height We use P x to denote the Xth percentile, e.g., P 98 is in range Percentile rank X: the percent of values that fall below X. e.g., percentile rank of the interval is 12.

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 23 Computing Percentiles, P. Ranks How do we compute percentiles and percentile ranks from grouped data? What is the score which defines the top 20% of scores? Is it between 84 and 85?

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 24 Computing Percentiles We want to compute P % of 50 cases = 40 cases. We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit).

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 25 Computing Percentiles We want to compute P % of 50 cases = 40 cases. We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit).

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 26 Computing Percentiles We want to compute P % of 50 cases = 40 cases. We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit). We need 8 more.

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 27 The interval contains = 15 cases. real limit 84.5 These are spread over width of 5 (= ). Assume scores are evenly distributed within interval 8 more cases ==> 8/15 * 5 = 2.67 (linear interpolation) P 80 = = Computing Percentiles We want to compute P % of 50 cases = 40 cases. We look under the cum f heading. 32 of the 40 scores are less than 84.5 (real limit). We need 8 more.

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 28 Computing Percentile Ranks We want to compute the percentile rank of 86 Lies in the interval 85-89, real limits 84.5 – = 1.5 score points. Width of interval = /5 = 0.3 ==> 30% of scores in interval (0.3*15 = 4.5)

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 29 Computing Percentile Ranks We want to compute the percentile rank of 86 Lies in the interval 85-89, real limits 84.5 – = 1.5 score points. Width of interval = /5 = 0.3 ==> 30% of scores in interval (0.3*15 = 4.5) So we have 32 scores up to scores from 84.5 to 86. Total: = 36.5 scores. 36.5/50 = 73%. This is the percentile rank of 86.

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 30 Frequency Distributions and Scales

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 31 Displaying Frequency Distributions: Nominal Data

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 32 Displaying Frequency Distributions: Ordinal/Numerical Data Histogram

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 33 Displaying Frequency Distributions: Ordinal/Numerical Data Histogram: Different Grouping

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 34 Lying with Visuals

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 35 Characteristics of Distributions Shape, Central Tendency, Variability Different Central Tendency Different Variability