MATH-138 Elementary Statistics

MATH-138 Elementary Statistics
Emily C. Francis Howard Community College Unit 1 Lecture Slides

What is Statistics? Statistics is the science of: Collecting data
Analyzing data Drawing conclusions and making decisions as a result of the data analysis. This is referred to as “statistical inference”.

What is a “Statistic”? A statistic is a function of the data
Data -> [function] -> statistic For example, suppose we have a data set of height of students. Taking the average, or mean, of heights is a function. Thus the mean height is a statistic of the data.

Phases in Statistical Analysis
Data Collection: The process of collecting data (samples) via surveys, observational studies, and/or designed experiments Data analysis: Graphing and summarizing key features of the data to discover major patterns in the data Statistical Inference: Drawing inferences (conclusions) and making decisions based on the data

Population vs. Sample For a given statistical inquiry:
The population consists of all items of interest (people, places, companies, etc.) A sample is a (hopefully representative and random) subset of the population A numerical value/characteristic of a population is called a parameter. These are usually unknown. A numerical value/characteristic of a sample is called a statistic 5 5

Components of a Data Set
Cases: people, places, companies, colleges, etc. Variables: characteristics/measurements of each individual case

Variable Types Categorical variables Quantitative variables
Have values that are described by words Represent categories Can be represented with #’s (the actual #’s assigned are irrelevant). The #’s have no units and no mathematical operations can be performed on these #’s. Quantitative variables Have numerical values and units

Displaying & Describing Categorical Data
No mathematical operations can be performed on categorical data Categorical data can only be counted and then described/displayed using: Frequency (and relative frequency) tables Bar charts Pie charts 8

Contingency Tables A contingency table shows how cases are distributed along each variable, contingent on the value of another variable Marginal and conditional distributions 9

Displaying & Summarizing Quantitative Data
A frequency (& relative frequency) distribution is an excellent initial data analysis tool A histogram is a visual representation of a frequency distribution. A relative frequency histogram is a visual representation of a relative frequency distribution Dotplot 10

Describing a Distribution
Shape Center Spread 11

Distribution Shape “Modality” Symmetry Outliers 12

Measures of Center Median Mean 13

Median The median of a variable is the midpoint of the sorted data values For odd n, the median equals the middle data value For even n, the median equals the average of the middle two values Is useful when the variable of interest has a skewed distribution and/or has outliers (it is not sensitive to these outliers) Does not have to be a data value 14

Mean The mean is the sum of all the data values divided by the # of data values: Treats all values equally and can therefore be influenced by outliers Does not have to be a data value Deviations from the mean to the data points always sum to zero Is useful when the variable of interest is symmetric with no outliers 15

Distribution Shape (Contd.)
Symmetric data: Mean is approx. equal to the median Tails of the distribution are balanced Skewed left data: Mean<Median Long tail of distribution “points” left A few low values, but most data on right Skewed right data: Median<Mean Long tail of distribution “points” right A few high values, but most data on left 16

Five-Number Summary Max Q3 Median Q1 Min 17

Measures of Spread Range Interquartile range (IQR) Variance
Standard deviation 18

Range & IQR The range is the difference between the maximum and minimum data values IQR = Q3 – Q1 The IQR is useful when the variable of interest has a skewed distribution and/or has outliers (it is not sensitive to these outliers) 19

Variance The variance is basically the average of the squared deviations from the mean: The units of this statistic are in squared units of the original data values 20

Standard Deviation The SD is the square root of the variance:
Is a single # that helps us understand how spread out the data is Units of measurement are the same as the original data 21

Standard Deviation (Contd.)
The standard deviation (and variance) statistics are never negative If every data value is equal, then there is no variation, and hence SD=Var=0 Is useful when the variable of interest is symmetric with no outliers 22

Boxplots A Boxplot is a graphical display of the five-number summary
The procedure to construct a boxplot can be found on pgs of the text 23

Standardized Variables: Z Scores
To “standardize” a variable, calculate each observation’s distance from the mean in units of the standard deviation. That is, define variable Z as: 24

Normal Models A normal model: Is symmetric and “bell” shaped
Is commonly used to model many things in the business and physical worlds Is defined by 2 parameters, μ (the mean) and σ (the standard deviation) Its distribution peaks at μ A normal distribution with mean=0 and std. dev.=1 is called “standard”

The 68-95-99.7 Rule For data from a NORMAL model:
~68% will lie within 1 std. dev. of the mean ~95% will lie within 2 std. dev’s of the mean ~99.7% (virtually all the data) will lie within 3 std. dev’s of the mean 26

Normalcdf & Invnorm If you are given a value(s) and you want a percentage under the normal model, you use “normalcdf” on your calculator: normalcdf(left value, right value, mean, std. dev.) If you are given a percentage under the normal model and you want a value, you use “invnorm” on your calculator: invnorm(percentage, mean, std. dev.) 27

Scatter Plots A scatter plot shows n pairs of bivariate data observations on an X-Y graph A scatter plot is usually the starting point for bivariate data analysis We create scatter plots to investigate the relationship between two variables: Direction Form Strength

Correlation In our discussion of correlation (and regression), we will be talking about paired sample data A correlation exists between 2 variables when one of them is related to the other in some way The linear correlation coefficient, r, measures the strength of the LINEAR relationship between two variables Before you calculate r, the following should hold: Quantitative variables condition “Straight Enough” condition Outlier condition

Correlation Properties
The value of r is always between -1 and 1, inclusive. That is, -1<=r<=1. The value of r is not affected by the choice of x or y r measures the strength of a linear relationship. It is not designed to measure the strength of a relationship that is not linear. Correlation is sensitive to outliers Correlation does not imply causality! Correlation does not measure slope

Regression If 2 variables have a “significant” linear correlation, it is appropriate to estimate their exact linear relationship – regression does this A regression estimates a and b so that the linear relationship between x and y can be expressed as: Note that is the PREDICTED value of y – thus, you can use this equation to predict values of y for given values of x (though not all values of x) The residual for any data point is: 31

Regression (Cont.) When predicting a value of y based on some given value of x, do the following: If there is NOT a linear correlation, the best predicted y-value is the sample average of y If there IS a linear correlation, the best predicted y-value is found using the regression equation 32

MATH-138 Elementary Statistics

Similar presentations

Presentation on theme: "MATH-138 Elementary Statistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MATH-138 Elementary Statistics

Similar presentations

Presentation on theme: "MATH-138 Elementary Statistics"— Presentation transcript:

Similar presentations

About project

Feedback