Download presentation
Presentation is loading. Please wait.
Published byἈθήνη Δασκαλοπούλου Modified over 5 years ago
1
PDI Data Literacy: Busting Myths of Big Data
Nairanjana (Jan) Dasgupta Professor, Dept. of Math and Stats Boeing Distinguished Professor of Math and Science Director, Center of Interdisciplinary Statistics Education and Research (CISER) Washington State University, Pullman, WA
2
Part 1: Data in general Data: Big or Small SOURCE of data
Types of Data Population versus Sample Experiment versus observational studies Exploratory and Confirmatory studies
3
Part 2: making sense of data
Distinction: Uni-variate, Bi-variate, Multi-variate, multiple Graphical Summary of data Numerical summary of data Measures of Center Measures of Spread Measures of the Dimensionality Summarizing multivariate data sets using clusters Population versus sample what do we have data on?
4
Part 3: Making decisions from data:
Going from sample to population Inference and decision making Estimation and Intervals Testing and Confidence Intervals Errors in testing: Type I and Type II Power Statistical significance P-value — good, bad or misused ASA’s statement about p-values
5
Part 4: Big Data and its pros and cons
What are the advantages of big data What do we mean by big? Big n or big p Decision making with big data Predictive analytics Back to population versus sample Overview and recap
6
Part 1:Data and its collection
Types of Data: The good the bad and the ugly
7
Statistics and Data Statistics has been often defined as the Science (or art) of collecting, compiling, summarizing, analyzing and inferring from data. By this definition: it is the science (or art) that is meant to deal with data So what is data science? The science dealing with big data?
8
Big Data: the elephant in the room
9
BIG Data? What is big data? How big is big?
Is it necessarily a good thing? Some thoughts about how we deal with it from the Statistics point of view We will start and end on this topic But focus in the middle for the most part of “good” data
10
Some facts about BIG data
Most of the time big data is generated not collected. No study design associated with its collection Often unclear what we want it to tell us: we are often doing a stab in the dark approach. I would like to coin the phase “opportunistic data” for big data that is not collected with a specific aim in mind, like social media data or phone data.
11
BIG Data: Some thoughts
Not much ACTUAL data analysis as the challenge is to actually manage and extract. Mostly pretty pictures and “dashboards’ An obsession now with decision makers: a buzzword. Having more doesn’t solve the problem if the data is not “GOOD” to start off with. Has more problems with BIAS as it is not collected in a systematic way. Issues with dimensionality. Extreme problems of multiple testing and false positives
12
Sobering Thoughts: Some Findings from BIG data that didn’t gel:
Prediction of flu outbreaks was off by a factor of 2 Academy Award predictions were more off than correct For the 2016 election Forbes predicted "If you believe in Big Data analytics, it’s time to begin planning for a Hillary Clinton presidency and all that entails.” Many other examples…
13
Data: GOOD, BAD or the culprit?
I will misquote Samuel Taylor Coleridge here (his quote was about water), when the ancient mariner was stuck in the middle of the ocean: Data data everywhere, and it really makes us blink Data, data everywhere but we’ve got to stop and think… This is the theme of today’s lecture: understanding data, types of data and what we can and cannot do with data. The Mathematics/Statistics that we need to understand to deal with data…
14
How can data be used? Just because we have data does it mean we know something? Can we say anything at an individual level? I want to go to WH Auden and address this question…
15
THE UNKNOWN CITIZEN BY W. H
THE UNKNOWN CITIZEN BY W. H. AUDEN(To JS/07 M 378This Marble MonumentIs Erected by the State) He was found by the Bureau of Statistics to be One against whom there was no official complaint, And all the reports on his conduct agree That, in the modern sense of an old-fashioned word, he was a saint, For in everything he did he served the Greater Community. Except for the War till the day he retired He worked in a factory and never got fired, But satisfied his employers, Fudge Motors Inc. Yet he wasn't a scab or odd in his views, For his Union reports that he paid his dues, (Our report on his Union shows it was sound) And our Social Psychology workers found That he was popular with his mates and liked a drink. …
16
Unknown Soldier… Contd
He was married and added five children to the population, Which our Eugenist says was the right number for a parent of his generation. And our teachers report that he never interfered with their education. Was he free? Was he happy? The question is absurd: Had anything been wrong, we should certainly have heard. From Another Time by W. H. Auden.
17
Data collected because or just because
There is a distinction whether we are collecting data with a specific object in mind or we are using existing data that is already available. Even a few years ago data was expensive and valuable and collected mostly with specific objects in mind Now, there is a deluge in terms of data that is available for use as it is being collected ANYWAY.
18
Data Source: Stats is well equipped to deal with
19
Sources of Opportunistic data:
20
Let us start with collecting some data
Let us start with some information about you: What is your department or unit? How many Statistics classes you have taken? On a scale of 1 to 5 rate your liking for Statistics Your average blood pressure when faced with a Stats problem There is no source that I know of where such data is available so let us physically collect it.
21
Types of data Discrete Numerical Continuous Data Categorical Nominal
Ordinal
22
What are these? Nominal: Name, category Discrete: what you count
Continuous: what you measure Ordinal is in some no-man’s land mostly categorical but with a numerical flavor. What is your unit? On a scale of 1 to 5 rate your liking for Stats How many Statistics classes you have taken? Your blood pressure when faced with a Stats problem
23
How to collect this data?
Questions 1-3 are self reported easily. How about question 4? Let us think about that
24
The big question is: WHY did I collect this information?
The reason we collected this information was to get some idea about all of you so I could come up with a plan of what to talk about, how much detail I go into etc. So, the idea is I take the data I collected and LEARN something from this data? Data by itself is just a bunch of numbers or categories and by itself it doesn’t mean much. What we need to do is figure out how we LEARN from data.
25
Why collect Data? In the past to answer a specific question or questions. However, nowadays data is collected without a specific object in mind, just because everything is data oriented, and collection is easy, internet, cell phone, credit card, Walmart etc. But in general we have a purpose for collecting data, a specific questions or questions. We want to learn something that we do not know from data.
26
Population versus Sample
Our question of interest is almost always about the big picture: something hard to study, but knowing would help us make decisions. Population: Sum total of all individuals and objects in a study Sample: part of the population selected for the study Idea: Get a good sample, study this sample carefully and infer about the population based on the sample. The data I just collected is this a population or a sample? Sample: make sure you stress that if we don’t study it it is not the sample
27
Population, Sample Inference
28
Where does statistical science come in?
If we could always study the population directly: we wouldn’t need Statisticians except for clerical jobs like summarizing… If we are relying on samples: we need to take a GOOD sample: the good is defined by Statistics What type of sample can I take? How exactly can I “infer” attributes about the population (parameter) based on a sample (statistic)? Caveat: the population needs to be a REAL representative population. The information I took on you, observation or experimental?
29
A few sample types
30
Experiments versus Observational Studies
Experiments: You change the environment to see the effect your change had, trying to control all other potential factors that might affect your study. Observational Study: You study the environment as is, and collect data on all possible factors that might be of interest to you. Questions 1-3 definitely observational Question 4 could have been an experiment but possibly observational Talk about designing the study where we are comparing three brands of cake mixes. Discuss Type of mix, temperature, oven effect, the placement effect
31
Experiment versus Observational study
32
Questions to think about:
If you had a choice: would you want to do an experiment or an observational study? Why? Is BIG data always better data?
33
What does it matter the type of study we conduct?
It matters because how we proceed to analyze the data should differ in terms of the type of study we had. Nowadays it is common to have data collected “just because” or “opportunistic data” and these are the extreme types of observational studies. As it wasn’t collected without any aim and it is hard to figure out if it is a population or a sample.
34
Exploratory versus Confirmatory studies
Exploratory studies: We do not have an idea about what we expect to find. So we study a bunch of factors to see how they affect what we are studying. It can be experimental or an observational study (though generally observational) Confirmatory Studies: Generally have an idea what we are expecting to find and do a very focused study to give credence to our beliefs. It can be experimental or an observational study (though generally experimental) Keep in my mind we cannot use data collected in an exploratory study to confirm our belief or hypothesis. Liken this to a fishing expedition and give examples of these kinds of study
35
Exploratory versus confirmatory studies
Exploratory studies generate hypothesis and often we do know/expect certain patterns We confirm these using confirmatory studies Though in general people often skip the confirmatory studies…
36
For Big data: what are we trying to learn
This is one of the harder questions about big data Is it a population that we have or is it a sample? If the latter, then what is the population? How can this sample be used to infer about a population when no frame was used to draw the sample? And statisticians and data scientists we need to think about this.
37
Part 1: recap It matters what TYPE of data we have.
It matters how the data were collected It matters whether we have a population or a sample It matters if you randomized the process of data collection If the population is studied all you need to do is to summarize, with a sample we need to think of inference. If we are really dealing with a population when we talk about big data: then all we need to do is visualize and summarize. No inference required.
38
Work sheet for Part 1 What type of data are the following:
Zip code, height, phone number, yearly income, size of family If we randomly choose 50 apple trees in an orchard and measure its height, canopy cover, number of apples: Would this be an observational study or experiment Would the data be univariate or multivariate Would this be an exploratory study or a confirmatory one?
39
Summarization: Making sense of numbers
Part 2 Summarization: Making sense of numbers
40
Summarizing Data Let us go back to the questions we started with:
What is your unit? On a scale of 1 to 5 rate your liking for Stats How many Statistics classes you have taken? Your blood pressure when facing a Stats problem Let us think about answering these questions and address things like: univariate, bivariate and multivariate in this context.
41
Univariate, Bivariate and Multivariate
Univariate: ONLY one variable is measured or disseminated at a particular time Bivariate: Two variables that of equal importance are measured or discussed together Multivariate: Multiple variables are measured and discussed together and each of the variables are of equal importance and collected using the same general method. Most data we collect is multivariate in nature, but we can choose to discuss one variable at a time. Not a great idea, but often done in science. What we collected here is multivariate data
42
Response and Explanatory Variables
For the bivariate and multivariate case we can have two different types of scenarios: ALL variables are equally important We are REALLY interested in one variable but collect the others to understand the variable of interest. The one that we are really interested in is called the RESPONSE variable. The others are called Explanatory variables. It was collected to explain the response. Response variable is taken to be a RANDOM variable (or stochastic). Explanatory variables are assumed non- random.
43
Multiple versus Multivariate
These are associated with whether we have multiple responses or explanatory variables: If we have multiple response variables and we are equally interested in them: multivariate If we have ONE response variable and multiple EXPLANATORY variables: multiple Give examples here: if we are intetested in all the varaibles we collected multivariate. If we want to predict the Stats score based on the others, we call it multiple regression
44
Analyzing UNIVARIATE data
With all the terminology intact, let us consider the most simple case. ONE response variable. We can graph it, and summarize it if it is a population from which we have the data. We can conduct inference if it is a sample. How and what we do, depends upon the data type we have. Hence, knowledge of data types is crucial
45
Summarizing Univariate Categorical Data
What would be some methods used to summarize categorical data? Graphical summary and numerical summary: What graphs are relevant for univariate categorical data? Pie chart Bar chart Line chart etc…
46
Numerical Data: Summary
Graphical summary: Box plots Histograms
47
Snap shot (out of 100 entries)
Unit Classes_taken Scale Blood_Pressure F B B C B B B A D
48
Graphs and Charts for our data
Pie Chart for the units Bar Chart for classes taken
49
Histogram and Box-plots
For Scale Summarizing BP
50
Summarizing Categorical data: Univariate
What numerical summaries would be relevant for categorical data: For example let us take your MAJORs How would you summarize all the information in our data by one (or a few numbers)? Idea of Central Tendency: Most naturally arising data has a tendency to clump in the middle of the range of possible values.
51
Measures of Central Tendency
How does one measure what is happening in the CENTER of the data? Thoughts? Mean Median Mode
52
Mean, Median, Mode What are the physical interpretations of these:
Center of Gravity Middle-most point Most frequently appearing number, category or group
53
Pictures of mean, median and mode
54
Measuring Central tendency for Categorical Data
When dealing with your majors: Does Mean make sense? Does Median make sense? How about mode?
55
Numerical Data Let us consider numerical data:
Which measure of central tendency makes sense here? Which would you prefer? Mean Median Mode
56
More Summarization: Measure of Center, provide us with a first step for summarizing data. But often it cannot differentiate between data sets. Consider the following data sets: Set1: Set2: Set3: They all have same mean and median=40. Are they identical? What makes them different?
57
Other summary measures
Measures of Spread Shapes of the distribution of data Where is the peak Measures of symmetry Percentiles
58
Measures of Spread Standard Deviation Variance Range
Inter-quartile range Median Absolute Deviation
59
Summarizing Bivariate and Multivariate data
Categorical: frequency tables Numerical: Scatter plots Many types of visualizations coming up because of “big data”. Most of these dashboards use a combination of known techniques like graphs with a few bells and whistles thrown in.
60
Table of Major and Scale
A B C D E F G H All All
61
Scatter Plot
62
Measures of Association
For bivariate data we can calculate measures of association Kendall’s tau (for ordinal data) Correlation Coefficient (for numerical data) For our data Kendall’s Tau for Major and Scale is .07 Correlation coefficient between BP and Classes .03
63
Summarizing multivariate data
Principal Component Analysis The following are loosely considered summarizing methods: Clustering
64
What is PCA? This is a MATHEMATICAL procedure that transforms a set of correlated responses into a smaller set of uncorrelated variables called PRINCIPAL COMPONENTS. Essentially gives us an idea about the dimensionality of the data Used for summarizing the overlap in the variables Uses: Data screening Clustering Discriminant Analysis
65
Objectives of PCA Reduce dimensionality, rather try to understand the TRUE dimensionality of the data Identify “meaningful” variables If you have a VARIANCE-COVARIANCE MATRIX, S: PCA returns new variables called Principal Components that are: Uncorrelated First component explains MOST of the variability The remaining PC explain decreasing amounts of variability
66
Caveats The whole idea of PCA is to transform a set of correlated variables to a set of uncorrelated variables, hence if the data are already uncorrelated, not much additional advantage of doing PCA. One can do PCA on correlation matrix or the Covariance matrix.
67
What is Clustering? Clustering is an EXPLORATORY statistical technique used to break up a data set into smaller groups or “clusters” with the idea that objects within a cluster are similar and objects in different clusters are different. It uses different distance measures between units of a group and across groups to decide which units fall in a group.
68
Clustering in General Idea: to group observations that are “similar” based on predefined criteria. Clustering can be applied to rows and/or columns Clustering allows for reordering of the rows/columns to make it appropriate for visualization. Fundamentally an exploratory tool, clustering is firmly imbedded in many people’s minds as the statistical method for the analysis…
69
Why Cluster? There are very few formal theories about clustering though intuitively the idea is: cluster the internal cohesion and external isolation. Time-course experiments are often clustered to see if there are developmental similarities. Useful for visualization. Generally considered appropriate in typical clinical experiments.
70
Clustering How is “closeness decided”?
For clustering we generally need two ideas: Distance: the original distance used to measure the distance between two points Linkage: condensation of each group of observations into a single representative point
72
Types of Clustering Hierarchical and Non-hierarchical methods:
Non-Hierarchical (Partitioning): Have an initial set of cluster seed points and then build clusters around the point, using one of the distance measures. If the cluster is too large, it can split into smaller ones. Hierarchical: Observed data points are grouped into clusters in a nested sequence of groups.
75
Dendograms The dendogram should be interpreted with care, remember each branch of the dendogram is really like a mobile and can rotate, without altering the mathematical structure of the tree. Neighboring nodes are “close” ONLY if they lie on the same branch. It has been proposed one should slice the tree and look at the clusters produced therein. However, WHERE to cut the tree is subjective and there is no consensus about this. Issue: mistakes made early have no way of being corrected later in this approach.
76
Some remarks on clustering- 1
Simplistically, clustering cannot fail. That is, every clustering method will return clusters, whether the data are organized in clusters or not. Clustering helps to group / order information and is a visualization tool for learning about the data. However, clustering results do not provide any kind of “proof” of anything.
77
Some remarks-II One of the more paradoxical aspects of clustering is that it gets used in practice, even when class labels are available instead of using a discrimination method. The idea is: it is somehow seen as less “biased” to demonstrate the ability of the data to produce the class differences without using class labels. When the inferred clusters largely coincide with the known classes, this is thought to “validate” the class labels. The illogicality and inefficiency of this process does not seem to have become widely appreciated. One sees different “classifiers” (e.g. different gene sets) compared w.r.t their ability to separate known classes, simply by inspecting the clustering they produce, rather than by building classifiers.
78
Comparing clusters:
79
Back to Population versus Sample
The data that we collected, was it a population or was it a sample? IF it is a population ALL we need to do is to summarize it or graph it. We have studied the population hence we do not need to do any more. So, the question now is the data that we have what do we do with it. Stop now (we are done summarizing or should we do something more).
80
Recap of Part 2 Use Pie charts, bar graphs for visualizing univariate categorical data Use box plots, histograms for univariate numerical data For bivariate data we can do scatter plots For multivariate data we can do clusters For numerical data use mean, median as measures of center For categorical use mode Use standard deviation, iqr for spread for numerical data For categorical use the frequency plot or tables to summarize it Summarization allows us to make sense of raw data and is crucial before we analyze it.
81
Worksheet for Part 2 How would you summarize numerically and graphically the measures from the apple tree: Height, canopy cover, number of apples, type of apple There are 3 different types of apple in your data and when you cluster the apples you get the following dendogram. So you think your belief that Variety 2 and 3 were more similar than 1 and 3 are justified.
82
Dendogram
83
What we wanted to learn and what we did learn
Overview What we wanted to learn and what we did learn
84
Take home messages (hopefully)
There are several types of data and each type has its own nuances. Big data is still data and can be BAD data. Big isn’t always better. The concept of population versus sample Experimental, observational and opportunistic studies Exploratory and confirmatory studies Distinction between univariate, multivariate To summarize type and dimension matters Graphs are worth a thousand words
85
Inference and analysis: Making Decisions from data
Part 3 Inference and analysis: Making Decisions from data
86
When and Why we need inference
IF the data we collected was really a population we do not need to do any inference. If however, we assume that this was just a sample from a larger population of ALL WSU graduate students, staff and faculty, then we need to take this sample and make inferences about this sample to relate back to the population.
87
Inference for univariate response
Let us again consider this from an univariate, bivariate and multivariate stand point for both categorical and numerical data. What are the questions we might want answered for the variable MAJOR. What about BP?
88
Inference: Use data and statistical methods to infer (get an idea, resolve a belief, estimate) something about the population based on results of a sample Inference Estimation Point Estimation Interval Estimation Hypothesis Testing
89
Estimation We have no idea at all about the parameter of interest and we use the sample information to get an idea (estimate) the population parameter Point Estimation: Using a single point to estimate the population parameter Interval Estimation: We use an interval of values to estimate the population parameter
90
Hypothesis Testing We have some idea about a population parameters.
We want to test this claim based on data we collect Probably one of the most used and most misunderstood method used in science. This provides us with the “dreaded p-value”.
91
Parameter To understand inference, we really need to get a very clear idea about what is a parameter. By definition: Parameter is a numerical characteristic (or characteristics if multivariate) of the population that is of interest. To make this specific: let us consider the following example Example: We are interested in the average number of statistics classes students have taken when they come to graduate school at WSU.
92
Population and Parameter:
Here the population is all graduate students at WSU. We have to be careful here: is it all CURRENT graduate students or all students past, present and future. To make matters easy let us say it is CURRENT graduate students. The parameter is: the average number of statistics classes taken by the students.
93
Sample and Statistic Our choices are to do a census and then compute the average from the entire census – this is the parameter Or to take a sample and calculate the number FROM the sample. If we use the sample and compute the number from the sample we call the sample average our STATISTIC.
94
How do we sample Here we need to think of how we sample this very well defined population. Thoughts? Hallmarks of a good sample: representative, unbiased, reliable
95
Estimation: Complete Ignorance about parameter
If we use the sample statistic to get an idea of the population sample, what we are doing is inference, specifically ESTIMATION What assures us that the sample statistic will be a good estimate of the population parameter? This leads us to Unbiasedness and Precision Do the bulls eye plot here
96
Where does Statistics come in?
97
Point Estimation The idea of point estimation seems intuitive: we use the sample value for the population value. The reason we can do this, is because we make certain assumptions about the probability distribution of the sample statistic Generally we assume that the sampling scheme we pick allows us an unbiased and high precision distribution of the statistic. If our method is indeed unbiased then the sample mean, should on average be a good estimator for the population mean.
98
Interval Estimation Even if we believe in the unbiasedness of our estimator, we still often want an interval rather than just a single value for the estimates. This allows us to have interval estimation. This technique takes into account the spread as well as the distribution in the estimation. It gives us an interval of values, in which we feel that our parameter is contained with high confidence. Talk about the fact that the interval is random rather than the parameter. Liken this to trying to capture a target with a horseshoe
99
Confidence Interval: In general a confidence interval for the population mean is given by: Sample mean ± margin of error Question is: how does one calculate “margin of error” Answer: we need distributions and random variables to do that. This means some mathematics and probability theory.
100
Confidence interval Used quite a bit in the past
Gives similar information as hypothesis tests Often can be inverted to construct tests However, theoretically it is quite different, as here we talk about the SIZE of the effect rather than significance of the effect. With all the bad press that p-values have received this might make a come back.
101
Hypothesis Testing: some idea about the parameter
We have some knowledge about the parameter A claim, a warranty, what we would like it to be We test the claim using our data; First step: formulating the hypothesis Always a pair: Research and Nullification of research (affectionately called Ho and Ha)
102
How to formulate your hypothesis
First state your claim or research. Let us say we believe that the average number of stats classes taken by graduate students coming into WSU is greater than 2. Here our parameter is Population mean of statistics classes taken by graduate students at WSU Claim: Mean > 2 What nullifies this? Mean ≤ 2 (Remember the “=“ always resides with the null)
103
Logic of Testing To actually test the hypothesis, what we try to do is to disprove or reject the null hypothesis. If we can reject the null, by default our Ha (which is our research) is true. Think of how the legal system works: H0: Not Guilty Ha: Guilty
104
How do we do this? We take a sample and look at the sample values. Then we see if the null was true, would this be a likely value of the sample statistic. If our observed value is not a likely value, we reject the null. How likely or unlikely a value is, is determined by the sampling distribution of that statistic.
105
Example In our example we were interested in the hypothesis about the average score in their last Stats class: H0: µ≤2 Ha: µ > 2 If we observed a sample with a mean of 4 and a standard deviation of 2 from a sample of 100 would you consider the null likely?? How about if the mean was 4 and the standard deviation was 20 from a sample of 100?
106
Players in decision making
Your observed statistic Your sample size Your observed standard deviation Your capacity of being able to find probabilistically how likely our observed value is under the null.
107
Errors in testing: Since we take our decisions about the parameter based on sample values we are likely to commit some errors. Type I error: Rejecting Ho when it is true (False Positive) Type II error: Failing to reject H0 when Ha is true (False Negative) In any given situation we want to minimize these errors. P(Type I error) = a, Also called size, level of significance. P(Type II error) = b, Power = 1-b, HERE we reject H0 when the claim is true. We want power to be LARGE. Power is the TRUE Positive we want.
108
Example I am introducing a new drug into the market. The drug may have some serious side effects. Before I do so I will go through tests to see if is effective in curing disease. H0: not effective Ha: drug is effective What is Type I error and Type II error in this case? Which is worse? More importantly think of the consequence of these errors.
109
One more example: Ann Landers in her advice column on the reliability of DNA testing for determining paternity advises, “To get a completely accurate result you would have to be tested, so would the man and your mother. The test is 100% accurate if the man is NOT your father, and 99.9% accurate if he is. Consider the hypothesis: Ho: a particular man is the father Ha: a particular man is not the father. Discuss the chances of probability of Type I and II errors.
110
Decision Making using Hypotheses:
In general, this is the way we make decisions. The idea is we want to minimize both Type I and II errors. However, in practice we cannot minimize both the errors simultaneously. What is done, is we fix our Type I error at some small level, ie 0.1, 0.05 or 0.01 etc. Then we find the test that will minimize our Type II error for this fixed level of Type I error. This gives us the the most powerful test. So in solving a hypothesis problem, we formulate our decision rule using the fixed value of Type I error. The decision rule is also called the CRITICAL VALUE.
111
How does rejection of null work with Critical values?
Here we calculate the value we can find a distribution of from the sample. Then we look at this value and compare it with the distribution of the sample statistic to allow ourselves Type I error of alpha. Based on this, if our observed value is beyond our critical value, we feel justified in rejecting the null. CRITICISM: Choice of alpha is arbitrary. We can make alpha big or small depending on what we want our outcome to be… Draw the picture here.
112
P-values: Elephant in the room
Sometimes hypothesis testing can be thought to be subjective. This is because the choice of a-values may alter a decision. Hence it is thought that one should report p- values and let the readers decide for themselves what the decision should be. p-value or probability value is the probability of getting a value worse than the observed. If this probability is small then our observed is an unlikely value under the null and we should reject the null. Otherwise we cannot reject the null.
113
P-value for our example
For the hypothesis we talked about earlier: H0: µ≤2 Ha: µ > 2 If we observed a sample with a mean of 4 and a standard deviation of 2 from a sample of 100 would you consider the null likely?? How about if the mean was 4 and the standard deviation was 20 from a sample of 100? P-value = P(Z > (4-2)/(2)/sqrt(100) | µ≤2 ) <.001 P-value = P(Z > (4-2)/(20)/sqrt(100) | µ≤2 ) <.16
114
Criticism of p-values As more and more people used p-values and with an effort to guard against the premise that “we can fool some of the people some of the time”, journals started having strict rules about p-value. To publish you needed to show small p-values. No SMALL p-values no publication… This has often led to publication of ONLY significant results. Also, led to let us get the p-value small by hook or crook attitude.
115
ASA Statement about p-value
It really tells us how incompatible the data are with a specified statistical model. P-values do not measure the probability that studied hypothesis is true or that the data were produced by random chance alone. Scientific conclusions and business policy decisions should not be based on only whether a p-value passes a specific threshold. Proper inference requires full reporting and transparency P-value does NOT measure the size of the effect or the importance of the result. By itself the p-value cannot provide a good measure of evidence regarding the model.
116
A simple example: Term Coef SE Coef T-Value P-Value Constant X
117
Power: The other elephant in the room
Power is the TRUE positive In other words what is the probability you would reject the null under a specified value of the alternative. So first we need to figure out what value of the alternative we choose to calculate the power. This choice is up to us and we often call it the effect size.
118
Example of power For the hypothesis we talked about earlier: H0: µ≤2
Ha: µ > 2 If we observed a sample with a mean of 4 and a standard deviation of 2 from a sample of Calculate power when mu=2.5, 3, 3.5, 4 Power = P (Z> (2.5-2)/2/10 ) = P(Z > 2.5) = .0162 Etc.
119
Power in pictures
120
What are the players for power?
Sample size Effect size Standard deviation. So to really calculate power one needs to have data to understand the distribution and have a feel for the standard deviation. Pre-hoc power calculation is often “trying to fool some of the people some of the time”
121
Recap of Part 3 To make inferences our population of interest and parameter needs to be well defined. Errors exist in testing and have to be considered P-values are a measure of incompatibility of the existing data given the null hypothesis and cannot be used to PROVE anything. To calculate power we need to look into values under the alternative and this can be subjective.
122
Worksheet for Part 3: True or False: Type I error is always the worst hence w should focus on controlling it rather than Type II error. You believe that the average time it takes students to walk from one class to another at WSU is more than the 10 minutes you are allotted. Write out your null and alternate hypothesis Write down what the type 1 error would be in this context. You test it and get a p-value of .13, does this indicate that the null is true?
123
Big data, its pros and cons
Part 4: Big data, its pros and cons
125
What determines big data :The 5 V’s
Volume Considered too large for regular software Variety Often a mix of many different data types Velocity Extreme speed at which data generated Variability Inconsistency of the data set Veracity How reliable is this data
126
How big is big? By big we mean its volume is such that it is hard to analyze this on a single computer. That in itself shouldn’t be problematic But requiring specialized machines to analyze this has added to the myth and enigma of big data. The problem with big data, at least as I see it, is some very pertinent statistical questions are bypassed when dealing with it.
127
Some statistical thoughts?
Is the big data a sample or a population? If it is really a population: then analysis means constructing summary statistics. This is bulky but not too difficult. If it is a sample: what was the sampling frame? If no population was considered when collecting this data, it is definitely not a representative sample. So, should one really do inference on BIG data? If one is allowed to do inference wouldn’t the sheer size of the data, give us so much power that we can pretty much come to any decision we test for.
128
Structure of data Generally most data sets are rectangular in nature with p variables and n observations collected. For the data we collected we have p=4 and n=100. In big data we often have many more predictors than observations (the big p problem) Many more (orders of magnitude more) observations than predictors, (the big n problem). Both n and p are big and are fluid as they are constantly updated and amassed.
129
The Variety and Velocity Piece
Generally opportunistic data is a mix of categorical, discrete, ordinal, continuous and a mix of that as well. So if we use it as a multivariate we have to think about how to proceed. While not trivial this can be surmounted with too much difficulty. The big issue is often the data is being amassed (I am not using collected intentionally) at a faster rate than it can actually be analyzed and summarized.
130
Variability and Veracity
This type of data is extremely variable and there is no systematic model in place to capture the components of variability. Modeling is very hard when you have no idea about the sources of variability in these types of data sets. Veracity: is it measuring what we think it is? How truthful is this data? Just because it is big, is it really good? We all need to ponder these questions and thoughts.
131
Visualization of big data
Often called dashboards Really a collection of well known age old graphs that most of you can do in excel! It is really just summary data in pretty colors. Don’t be fooled by these fancy terms.
132
Example of a Dashboard.
133
Analysis versus Inference
As the whole question of whether it is a sample or a population in itself is muddy let us leave inference out for now and now focus on analyzing. A common analysis method associated with opportunistic data is predictive analysis.
134
Predictive Analytics and big data
Encompasses: prediction models, machine learning, data mining for prediction of the unknown using the history of the past. If we are predicting are we inferring?? I will assume it is okay to do that. Exploits patterns found in historical and allows assessment of risk associated with a particular set of conditions. Credit scoring has used predictive analytics for a long time However, here at least in the past sampling was done to perform inference.
135
Techniques used in Predictive Analytics or supervised learning
Regression techniques Logistic regression Time series models Survival or duration analysis Classification and Discrimination Regression Trees Modeled by humans etc., Neural networks Multilayer perceptron Radial basis functions Support vector machines Naïve Bayes k-nearest neighbors Geospatial predictive modeling Done by machines: no model etc., Analytical Methods Machine Learning Methods
137
Supervised Learning Idea is learning from a known data set to predict the unknown. Essentially we know the class labels ahead of time. What we need to do is find a RULE using features in the data that DISCRIMINATES effectively between the classes. So that if we have a new observation with its features we can correctly classify it. Machine Learning uses this idea and so it is very popular now. I will briefly cover this topic mostly with examples
138
Example 1: Turkey Thief There was this legal case in Kansas where a turkey farmer accused his neighbor of stealing turkeys from the farm. When the neighbor was arrested and the police looked in the freezer, there were multiple frozen turkeys there. The accused claimed these were WILD turkey that he had caught. The Statistician was called in to give evidence as there are some biological differences between domestic and wild turkey. So the biologist measured the bones and other body characteristic of the domestic and Wild turkeys and the Statistician built a DISCRIMANT function. They used the classification function to see if the turkeys in the freezer fell into he WILD or DOMESTIC class. THEY ALL fell in the DOMESTIC classification!
139
Steps Selection of features Model Fitting
Model Validation using prediction of known classes Feature selection is done by the computer No model, but computer determines the functions of the predictors used Model is validated based on prediction of known classes Discriminant Analysis Machine Learning
140
Feature Selection Find which of the observed variables can actually distinguish between the classes of interest. This is variable selection Don’t be confused and awed when people use terms like Elastic Net or Lasso at you in this context. These are fairly straightforward methods for model selection.
141
MODEL FITTING Commonly used: LDA K Nearest Neighbor QDA
142
Without models we can use Machine Learning methods
Neural networks Naïve Bayes Support Vector machines
144
Validation See how well the classifiers classify the observations into the different classes. Mostly commonly used method leave-one-out-cross validation. Though test data set (holdout sample) and resubmissions are still used.
145
Recap of Part 4 The sticky problem is if the data we have is a sample or a population. Inference is tough, as it is hard to figure out to what population we are inferring for. Predictive analytics often associated with big data At the end of the day, machines are faster and more efficient but cannot create interpretative models (not yet). We still don’t know if big data is good data, it depends upon who is collecting it and for what purpose.
146
Worksheet for Part 4 A Genome wide association study was undertaken to see if we could identify the different SNPs (Single nucleotide Polymorphisms) from two groups using a case-control set up. Each group had 5000 units and we looked at 1 million SNPs in each case. We also collected other control data from bot groups. Would you consider this a big data? Would you consider this an opportunistic data? Do you think we can do inferences in this case?
147
What we wanted to learn and what we did learn
Overview What we wanted to learn and what we did learn
148
Take home messages (hopefully)
There are several types of data and each type has its own nuances. The concept of population versus sample Experimental, observational and opportunistic studies Exploratory and confirmatory studies Distinction between univariate, multivariate To summarize type and dimension matters Graphs are worth a thousand words
149
More take home messages
Estimation and Testing Errors in testing Decision making, power and p-values Big data is still data and it can be bad data Machine learning and Statistics Dashboards are not in only in cars but a data visualization method used by “analytics” firms to produce well known graphs.
150
Jargon not to be awed by:
Elastic Net Lasso Bagging Dashboard Machine Learning Neural networks Feature selection Deep learning These are all statistics terms that is easy to explain and well-known and used to fool some of the people some of the time.
151
Myth of Big Data There is no myth, it is just unwieldy, unstructured, undersigned data that is already being amassed. It still has to be good data for us to make good analysis and predictions. At the end of the day to make inferences on data (big or small) we need it to be representative.
152
Part6: CISER: What we do and how we can help Graduate Students, post-docs and faculty
153
Statistical Help at WSU: Center for Interdisciplinary Statistics Education and Research (CISER)
Our mission Education Collaborative Research Building a Community
154
Assistance types
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.