BA 598 Stats Workshop Nairanjana (Jan) Dasgupta

BA 598 Stats Workshop Nairanjana (Jan) Dasgupta
Professor, Dept. of Math and Stats Boeing Distinguished Professor of Mathematics and Science Director, Center of Interdisciplinary Statistics Education and Research (CISER) Washington State University, Pullman, WA

CISER: What we do and how we can help Graduate Students
Part1: CISER: What we do and how we can help Graduate Students

Statistical Help at WSU: Center for Interdisciplinary Statistics Education and Research (CISER)

Assistance types

Logic and Statistics of Univariate Analysis
Part 2 Logic and Statistics of Univariate Analysis

Topics for the Univariate Stats: what, how and why
Types of Data. Why we collect data? Population versus Sample Experiments, observational studies (which ones are common in your field) Exploratory studies versus confirmatory studies The Idea of inference Distinction: Uni-variate, Bi-variate, Multi-variate, multiple (with example data sets) Graphical Summary Numerical summary From center to one sample tests The idea of testing and p-values Random variables and distributions

Data: GOOD, BAD or the culprit?
I will misquote Samuel Taylor Coleridge here (his quote was about water), when the ancient mariner was stuck in the middle of the ocean: Data data everywhere, and it really makes us blink Data, data everywhere but we’ve gotta stop and think… This is the theme of today’s lecture: understanding data, types of data and what we can and cannot do with data. The Mathematics that we need to understand to deal with data…

Types of Data Let us start with some information about you:
What is your major? How many Statistics classes you have taken? On a scale of 1 to 5 rate your liking for Stats Your score in the last Stats class you took Your average blood pressure when faced with a Stats problem

Types of data Data Numerical Discrete Continuous Categorical Nominal

What are these? Nominal: Name, category Discrete: what you count
Continuous: what you measure Ordinal is in some no-man’s land mostly categorical but with a numerical flavor. What is your major? How many Statistics classes you have taken? On a scale of 1 to 5 rate your liking for Stats Your score in the last Stats class you took Your blood pressure when faced with a Stats problem

The big question is: WHY did I collect this information?
The reason we collected this information was to get some idea about all of you so I could come up with a plan of what to talk about, how much detail I go into etc. So, the idea is I take the data I collected and LEARN something from this data? Data by itself is just a bunch of numbers or categories and by itself it doesn’t mean much.

Why collect Data? In the past to answer a specific question or questions. However, nowadays data is collected without a specific object in mind, just because everything is data oriented, and collection is easy. But in general we have a purpose for collecting data, a specific questions or questions. We want to learn something that we do not know from data.

Population versus Sample
Our question of interest is almost always about the big picture: something hard to study, but knowing would help us make decisions. Population: Sum total of all individuals and objects in a study Sample: part of the population selected for the study Idea: Get a good sample, study this sample carefully and infer about the population based on the sample. Sample: make sure you stress that if we don’t study it it is not the sample

Where does statistics come in?
We need to take a GOOD sample: the good is defined by Statistics What type of sample can I take? Is my study observational or experimental? Is it exploratory or confirmatory? How exactly can I “infer” attributes about the population (parameter) based on a sample (statistic)? The information I took on you, observation or experimental?

Experiments versus Observational Studies
Experiments: You change the environment to see the effect your change had, trying to control all other potential factors that might affect your study. Observational Study: You study the environment as is, and collect data on all possible factors that might be of interest to you. Talk about designing the study where we are comparing three brands of cake mixes. Discuss Type of mix, temperature, oven effect, the placement effect

Exploratory versus Confirmatory studies
Exploratory studies: We do not have an idea about what we expect to find. So we study a bunch of factors to see how they affect what we are studying. It can be experimental or an observational study (though generally observational) Confirmatory Studies: Generally have an idea what we are expecting to find and do a very focused study to give credence to our beliefs. It can be experimental or an observational study (though generally experimental) Keep in my mind we cannot use data collected in an exploratory study to confirm our belief or hypothesis. Liken this to a fishing expedition and give examples of these kinds of study

Inference: Use data and statistical methods to infer (get an idea, resolve a belief, estimate) something about the population based on results of a sample Inference Estimation Point Estimation Interval Estimation Hypothesis Testing

Estimation We have no idea at all about the parameter of interest and we use the sample information to get an idea (estimate) the population parameter Point Estimation: Using a single point to estimate the population parameter Interval Estimation: We use an interval of values to estimate the population parameter

Hypothesis Testing We have some idea about a population parameters.
We want to test this claim based on data we collect

Summarizing, inferring and random variables
The Details Summarizing, inferring and random variables

Summarizing Data Statistics will help you understand how we summarize different kinds of data: Let us go back to the questions we started with: What is your major? How many Statistics classes you have taken? On a scale of 1 to 5 rate your liking for Stats Your score in the last Stats class you took Your blood pressure when facing a Stats problem Let us think about answering these questions and address things like: experimental, observational, exploratory confirmatory, univariate, bivariate and multivariate in this context.

Univariate, Bivariate and Multivariate
Univariate: ONLY one variable is measured or disseminated at a particular time Bivariate: Two variables are measured or discussed together Multivariate: Multiple variables are measured and discussed together What we collected here is multivariate data

Response and Explanatory Variables
For the bivariate and multivariate case we can have two different types of scenarios: The variables are equally important We are REALLY interested in one variable but collect the others to understand the variable of interest. The one that we are really interested in is called the RESPONSE variable. The others are called Explanatory variables. It was collected to explain the response. Response variable is taken to be a RANDOM variable (or stochastic). Explanatory variables are assumed constants.

Multiple versus Multivariate
These are associate with whether we have multiple responses or explanatory variables: If we have multiple response variables and we are equally interested in them: multivariate If we have ONE response variable and multiple EXPLANATORY variables: multiple Give examples here: if we are intetested in all the varaibles we collected multivariate. If we want to predict the Stats score based on the others, we call it multiple regression

Analyzing UNIVARIATE data
With all the terminology intact, let us consider the most simple case. ONE response variable. We can graph it, do estimation or test a hypothesis about this ONE response. How and what we do, depends upon the data type we have.

Summarizing Univariate Categorical Data
What would be some methods used to summarize categorical data? Graphical summary and numerical summary: What graphs are relevant for univariate categorical data? Pie chart Bar chart Line chart etc…

High lead concentration and % of whites and Zip Codes

Summarizing Categorical data: Univariate
What numerical summaries would be relevant for categorical data: For example let us take your MAJORs How would you summarize all the information in our data by one (or a few numbers)? Idea of Central Tendency: Most naturally arising data has a tendency to clump in the middle of the range of possible values.

Measures of Central Tendency
How does one measure what is happening in the CENTER of the data? Thoughts? Mean Median Mode

Mean, Median, Mode What are the physical interpretations of these:
Center of Gravity Middle-most point Most frequently appearing number, category or group

Pictures of mean, median and mode

Measuring Central tendency for Categorical Data
When dealing with your majors: Does Mean make sense? Does Median make sense? How about mode?

Numerical Data: Summary
Graphical summary: Box plots Histograms

Summarizing Lead Data over all 8 zipcodes

Another example: Variable N Mean StDev Minimum Median Maximum
Diameter

Numerical Data Let us consider numerical data:
Which measure of central tendency makes sense here? Which would you prefer? Mean Median Mode

More Summarization: Measure of Center, provide us with a first step for summarizing data. But often it cannot differentiate between data sets. Consider the following data sets: Set1: Set2: Set3: They all have same mean and median=40. Are they identical? What makes them different?

Other summary measures
Measures of Spread Shapes of the distribution of data Where is the peak Measures of symmetry Percentiles

Measures of Spread Standard Deviation Variance Range
Inter-quartile range Median Absolute Deviation

Standard Deviation:

Inference for univariate response
We are interested in some attribute about a population parameter. Let us use 2 examples here: Interested in the proportion of students who are finance majors among all the students interested the students interested in CCOB Interested in the average score of all students in CCOB in their last stats class they took.

Population and Parameter
Interested in the proportion of students who are finance majors among all the students interested the students interested in CCOB Interested in the average score of all students in CCOB in their last stats class they took.

Sample and Statistic The data we collected gives us the proportion of students in THIS group who are finance majors Gives us the average score in their last stats class for students in this class. Is this is a sample? A good sample? What is the Statistic?

Estimation If we use the sample statistic to get an idea of the population sample, what we are doing is inference, specifically ESTIMATION What assures us that the sample statistic will be a good estimate of the population parameter? This leads us to unbiasedness and Precision Do the bulls eye plot here

Point Estimation The idea of point estimation seems intuitive: we use the sample value for the population value. The reason we can do this, is because we make certain assumptions about the probability distribution of the sample statistic Generally we assume that the sampling scheme we pick allows us an unbiased and high precision distribution of the statistic. If our method is indeed unbiased then the Population mean, expectation of our sample statistic should give us the population parameter.

Interval Estimation Even if we believe in the unbiasedness of our estimator, we still often want an interval rather than just a single value for the estimates. This allows us to have interval estimation. This technique takes into account the spread as well as the distribution in the estimation. It gives us an interval of values, in which we feel that our parameter is contained with high confidence. Talk about the fact that the interval is random rather than the parameter. Liken this to trying to capture a target with a horseshoe

Confidence Interval: In general a confidence interval for the population mean is given by: Sample mean ± margin of error Question is: how does one calculate “margin of error” Answer: we need distributions and random variables to do that.

Hypothesis Testing We have some knowledge about the parameter
A claim, a warranty, what we would like it to be We test the claim using our data; First step: formulating the hypothesis Always a pair: Research and Nullification of research (affectionately called Ho and Ha)

How to formulate your hypothesis
First state your claim or research. If we believe that the mean stat score of students in CCOB in their last Stats class is over 80%, let us pose this as a claim. Mean > 80 What nullifies this? Mean ≤ 80 (Remember the “=“ always resides with the null)

Logic of Testing To actually test the hypothesis, what we try to do is to disprove or reject the null hypothesis. If we can reject the null, by default our Ha (which is our research) is true. Think of how the legal system works: H0: Not Guilty Ha: Guilty

How do we do this? We take a sample and look at the sample values. Then we see if the null was true, would this be a likely value of the sample statistic. If our observed value is not a likely value, we reject the null. How likely or unlikely a value is, is determined by the sampling distribution of that statistic.

Example In our example we were interested in the hypothesis about the average score in their last Stats class: H0: µ≤80 Ha: µ > 80 If we observed a sample with a mean of 88 and a standard deviation of 2 from a sample of 100 would you consider the null likely?? How about if the mean was 88 and the standard deviation was 22 from a sample of 100?

Errors in testing: Since we take our decisions about the parameter based on sample values we are likely to commit some errors. Type I error: Rejecting Ho when it is true Type II error: Failing to reject H0 when Ha is true. In any given situation we want to minimize these errors. P(Type I error) = a, Also called size, level of significance. P(Type II error) = b, Power = 1-b, HERE we reject H0 when the claim is true. We want power to be LARGE.

Example I am introducing a new drug into the market. The drug may have some serious side effects. Before I do so I will go through tests to see if is effective in curing disease. H0: not effective Ha: drug is effective What is Type I error and Type II error in this case? Which is worse? More importantly think of the consequence of these errors.

One more example: Ann Landers in her advice column on the reliability of DNA testing for determining paternity advises, “To get a completely accurate result you would have to be tested, so would the man and your mother. The test is 100% accurate if the man is NOT your father, and 99.9% accurate if he is. Consider the hypothesis: Ho: a particular man is the father Ha: a particular man is not the father. Discuss the chances of probability of Type I and II errors.

Decision Making using Hypotheses:
In general, this is the way we make decisions. The idea is we want to minimize both Type I and II errors. However, in practice we cannot minimize both the errors simultaneously. What is done, is we fix our Type I error at some small level, ie 0.1, 0.05 or 0.01 etc. Then Type II error is minimized for this fixed value and we get the most powerful test. So in solving a hypothesis problem, we formulate our decision rule using the fixed value of Type I error. The decision rule is also called the CRITICAL VALUE.

How does rejection of null work with Critical values?
Here we calculate the value we can find a distribution of from the sample. Then we look at this value and compare it with the distribution of the sample statistic to allow ourselves Type I error of alpha. Based on this, if our observed value is beyond our critical value, we feel justified in rejecting the null. CRITICISM: Choice of alpha is arbitrary. We can make alpha big or small depending on what we want our outcome to be… Draw the picture here.

P-values Sometimes hypothesis testing can be thought to be subjective. This is because the choice of a-values may alter a decision. Hence it is thought that one should report p- values and let the readers decide for themselves what the decision should be. p-value or probability value is the probability of getting a value worse than the observed. If this probability is small then our observed is an unlikely value under the null and we should reject the null. Otherwise we cannot reject the null.

Criticism of p-values As more and more people used p-values and with an effort to guard against the premise that “we can fool some of the people some of the time”, journals started having strict rules about p-value. To publish you needed to show small p-values. No SMALL p-values no publication… So, let us now take a look at distribution theory, understand one of the most misunderstood concepts the RANDOM variable and then understand p-values…

Random Variables and Distributions

Random Variable Any numerical realization of a random experiment results in a Random Variable. In layman terms it is a number which has a certain probability attached to it. Consider a very simple random variable, y, the number of heads when we toss 4 coins simultaneously. The outcomes are: TTTT TTTH TTHT THTT HTTT TTHH THTH THHT HTHT HHTT HTTH THHH HTHH HHTH HHHT HHHH

Random Variable and Probabilities
Hence, the random variable y takes the possible values 0,1,2,3,4, with the probabilities f(y). We say f(y) denotes the mass function of y y1 P(Y=y1)=f(y1) 0 1/16 1 4/16 2 6/16 3 4/16 4 1/16

Probabilities of Discrete, categorical and Continuous data
So here the RANDOM variable is y1 (the number of heads in the 4 tosses) and so for each realization of the random variable we have a probability or likelihood attached to it. The probability is calculated based on how many possibilities there were (Sample Space) and how many supported the event we are interested in. We also made a tacit assumption that all the events were equally likely (ie we were just as likely to get heads as tails). Probability for discrete and categorical outcomes is attached to a specific value of the random variable.

Continuous Random variables
The continuous random variables are not points on the number line, as there would be an infinite number of points. So here we think of it as probabilities in an interval. And in general these are identified by their probability density function. The most common of the continuous random variables especially in Stats is the NORMAL distribution.

If y is our random variable we often use a curve to depict f(y)

What does f(y) look like for Normal?
Let Y follow a Normal distribution with mean and standard deviation  f(y) = 𝑒 − 𝑦−𝜇 𝜎 𝜎√2𝜋 If Y* = a + bY, Y* follows a Normal distribution with E(Y*) = a + band Var(Y*) = (bs)2

Properties of Normal Distribution:
Actually this also holds true if Y asymptotically follows a Normal distribution by a result called the Mann-Wald Theorem. A normal distribution with mean 0 and variance 1 is called the STANDARD normal distribution.

Other Distributions: Chi-square

Other Distributions: t and F

In the middle of it all is the CLT:

Back to inference We want to understand something about the population
So we take a sample Standardized Sample mean follows a normal distribution as long as the sample size is large So, for the time being as long as the sample size is large, we can do inference for the population mean. If sample size is not large we can use the t-distribution.

Critical regions and p-values
The normal distribution and the t-distribution allows us to test by allowing us to calculate the critical regions. It also allows us to calculate p-values.

For next time: Finish up actually calculating p-values and critical values. Learn how to summarize data in R (numerically and graphically) Learn how to do hypothesis testing, confidence intervals in R.

BA 598 Stats Workshop Nairanjana (Jan) Dasgupta

Similar presentations

Presentation on theme: "BA 598 Stats Workshop Nairanjana (Jan) Dasgupta"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BA 598 Stats Workshop Nairanjana (Jan) Dasgupta

Similar presentations

Presentation on theme: "BA 598 Stats Workshop Nairanjana (Jan) Dasgupta"— Presentation transcript:

Similar presentations

About project

Feedback