PDI Data Literacy: Busting Myths of Big Data: Part2

PDI Data Literacy: Busting Myths of Big Data: Part2
Nairanjana (Jan) Dasgupta Professor, Dept. of Math and Stats Boeing Distinguished Professor of Math and Science Director, Center of Interdisciplinary Statistics Education and Research (CISER) Washington State University, Pullman, WA

Part 1: what we have covered in September 2018
Types of Data Population versus Sample Exploratory and Confirmatory studies Experiment versus observational studies Distinction: Uni-variate, Bi-variate, Multi-variate, multiple Graphical/Numerical Summary of data Measures of Center and Spread Measures of the Dimensionality

Synopsis of Part 1 It matters what TYPE of data we have.
It matters how the data were collected It matters whether we have a population or a sample It matters if you randomized the process of data collection If the population is studied all you need to do is to summarize, with a sample we need to think of inference. If we are really dealing with a population when we talk about big data: then all we need to do is visualize and summarize. No inference required.

Synopsis continued Use Pie charts, bar graphs for visualizing univariate categorical data Use box plots, histograms for univariate numerical data For bivariate data we can do scatter plots For multivariate data we can do clusters For numerical data use mean, median as measures of center For categorical use mode Use standard deviation, iqr for spread for numerical data For categorical use the frequency plot or tables to summarize it Summarization allows us to make sense of raw data and is crucial before we analyze it.

Synopsis continued Population versus sample what do we have data on?
Big data is often opportunistic data and so an extreme form of observational study. However, if we assume we have a sample and we want to answer questions about a population we need to think inference. Today’s discussion will be on inference.

Today’s topics: Inference: Making decisions from data:
Going from sample to population Inference and decision making Estimation and Intervals Testing and Confidence Intervals Errors in testing: Type I and Type II Power Statistical significance P-value — good, bad or misused ASA’s statement about p-values

Today’s topics continued: Big Data and its pros and cons
What are the advantages of big data What do we mean by big? Big n or big p Decision making with big data Predictive analytics Back to population versus sample Overview and recap

Inference and analysis: Making Decisions from data
Part 2: Section 1 Inference and analysis: Making Decisions from data

When and Why we need inference
IF the data we collected was really a population we do not need to do any inference. Let us consider the question: what is the average number of statistics classes taken by students before entering graduate school? If consider the audience – would this be a population or a sample? Would my current audience be a GOOD sample?

Inference: Use data and statistical methods to infer (get an idea, resolve a belief, estimate) something about the population based on results of a sample Inference Estimation Point Estimation Interval Estimation Hypothesis Testing

Estimation We have no idea at all about the parameter of interest and we use the sample information to get an idea (estimate) the population parameter Point Estimation: Using a single point to estimate the population parameter Interval Estimation: We use an interval of values to estimate the population parameter

Hypothesis Testing We have some idea about a population parameters.
We want to test this claim based on data we collect Probably one of the most used and most misunderstood method used in science. This provides us with the “dreaded p-value”.

Parameter To understand inference, we really need to get a very clear idea about what is a parameter. By definition: Parameter is a numerical characteristic (or characteristics if multivariate) of the population that is of interest. Let us go back to our example: We are interested in the average number of statistics classes students have taken when they come to graduate school here.

Population and Parameter:
Here the population is all graduate students at WSU. We have to be careful here: is it all CURRENT graduate students or all students past, present and future. To make matters easy let us say it is CURRENT graduate students. The parameter is: the average number of statistics classes taken by the students.

Sample and Statistic Our choices are to do a census and then compute the average from the entire census – this is the parameter Or to take a sample and calculate the number FROM the sample. If we use the sample and compute the number from the sample we call the sample average our STATISTIC.

How do we sample Here we need to think of how we sample this very well defined population. Thoughts? Hallmarks of a good sample: representative, unbiased, reliable

Estimation: Complete Ignorance about parameter
If we use the sample statistic to get an idea of the population sample, what we are doing is inference, specifically ESTIMATION What assures us that the sample statistic will be a good estimate of the population parameter? This leads us to Unbiasedness and Precision Do the bulls eye plot here

Where does Statistics come in?

Point Estimation The idea of point estimation seems intuitive: we use the sample value for the population value. The reason we can do this, is because we make certain assumptions about the probability distribution of the sample statistic Generally we assume that the sampling scheme we pick allows us an unbiased and high precision distribution of the statistic. If our method is indeed unbiased then the sample mean, should on average be a good estimator for the population mean.

Interval Estimation Even if we believe in the unbiasedness of our estimator, we still often want an interval rather than just a single value for the estimates. This allows us to have interval estimation. This technique takes into account the spread as well as the distribution in the estimation. It gives us an interval of values, in which we feel that our parameter is contained with high confidence. Talk about the fact that the interval is random rather than the parameter. Liken this to trying to capture a target with a horseshoe

Confidence Interval: In general a confidence interval for the population mean is given by: Sample mean ± margin of error Question is: how does one calculate “margin of error” Answer: we need distributions and random variables to do that. This means some mathematics and probability theory.

Confidence interval Used quite a bit in the past
Gives similar information as hypothesis tests Often can be inverted to construct tests However, theoretically it is quite different, as here we talk about the SIZE of the effect rather than significance of the effect. With all the bad press that p-values have received this might make a come back.

Hypothesis Testing: some idea about the parameter
We have some knowledge about the parameter A claim, a warranty, what we would like it to be We test the claim using our data; First step: formulating the hypothesis Always a pair: Research and Nullification of research (affectionately called Ho and Ha)

How to formulate your hypothesis
First state your claim or research. Let us say we believe that the average number of stats classes taken by graduate students coming into WSU is greater than 2. Here our parameter is Population mean of statistics classes taken by graduate students at WSU Claim: Mean > 2 What nullifies this? Mean ≤ 2 (Remember the “=“ always resides with the null)

Logic of Testing To actually test the hypothesis, what we try to do is to disprove or reject the null hypothesis. If we can reject the null, by default our Ha (which is our research) is true. Think of how the legal system works: H0: Not Guilty Ha: Guilty

How do we do this? We take a sample and look at the sample values. Then we see if the null was true, would this be a likely value of the sample statistic. If our observed value is not a likely value, we reject the null. How likely or unlikely a value is, is determined by the sampling distribution of that statistic.

Example In our example we were interested in the hypothesis about the average number of classes taken by incoming graduate students: H0: µ≤2 Ha: µ > 2 If we observed a sample with a mean of 4 and a standard deviation of 1 from a sample of 100 would you consider the null likely?? How about if the mean was 4 and the standard deviation was 20 from a sample of 100?

Players in decision making
Your observed statistic Your sample size Your observed standard deviation Your capacity of being able to find probabilistically how likely our observed value is under the null.

Errors in testing: Since we take our decisions about the parameter based on sample values we are likely to commit some errors. Type I error: Rejecting Ho when it is true (False Positive) Type II error: Failing to reject H0 when Ha is true (False Negative) In any given situation we want to minimize these errors. P(Type I error) = a, Also called size, level of significance. P(Type II error) = b, Power = 1-b, HERE we reject H0 when the claim is true. We want power to be LARGE. Power is the TRUE Positive we want.

Example I am introducing a new drug into the market. The drug may have some serious side effects. Before I do so I will go through tests to see if is effective in curing disease. H0: not effective Ha: drug is effective What is Type I error and Type II error in this case? Which is worse? More importantly think of the consequence of these errors.

One more example: Ann Landers in her advice column on the reliability of DNA testing for determining paternity advises, “To get a completely accurate result you would have to be tested, so would the man and your mother.” Consider the hypothesis: Ho: a particular man is the father Ha: a particular man is not the father. Discuss the chances of probability of Type I and II errors.

Decision Making using Hypotheses:
In general, this is the way we make decisions. The idea is we want to minimize both Type I and II errors. However, in practice we cannot minimize both the errors simultaneously. What is done, is we fix our Type I error at some small level, ie 0.1, 0.05 or 0.01 etc. Then we find the test that will minimize our Type II error for this fixed level of Type I error. This gives us the most powerful test. So in solving a hypothesis problem, we formulate our decision rule using the fixed value of Type I error. The decision rule is also called the CRITICAL VALUE.

How does rejection of null work with Critical values?
First we calculate the value of the sample/test statistic. Then we look at this value and compare it with the distribution of the sample statistic to allow ourselves Type I error of alpha. Based on this, if our observed value is beyond our critical value, we feel justified in rejecting the null. CRITICISM: Choice of alpha is arbitrary. We can make alpha big or small depending on what we want our outcome to be… Draw the picture here.

P-values: Elephant in the room
Sometimes hypothesis testing can be thought to be subjective. This is because the choice of a-values may alter a decision. Hence it is thought that one should report p- values and let the readers decide for themselves what the decision should be. p-value or probability value is the probability of getting a value worse than the observed. If this probability is small then our observed is an unlikely value under the null and we should reject the null. Otherwise we cannot reject the null.

P-value for our example
For the hypothesis we talked about earlier: H0: µ≤2 Ha: µ > 2 If we observed a sample with a mean of 4 and a standard deviation of 2 from a sample of 100 would you consider the null likely?? How about if the mean was 4 and the standard deviation was 20 from a sample of 100? P-value = P(Z > (4-2)/(2)/sqrt(100) | µ≤2 ) <.001 P-value = P(Z > (4-2)/(20)/sqrt(100) | µ≤2 ) <.16

Criticism of p-values As more and more people used p-values and with an effort to guard against the premise that “we can fool some of the people some of the time”, journals started having strict rules about p-value. To publish you needed to show small p-values. No SMALL p-values no publication… This has often led to publication of ONLY significant results. Also, led to let us get the p-value small by hook or crook attitude.

ASA Statement about p-value
It really tells us how incompatible the data are with a specified statistical model. P-values do not measure the probability that studied hypothesis is true or that the data were produced by random chance alone. Scientific conclusions and business policy decisions should not be based on only whether a p-value passes a specific threshold. Proper inference requires full reporting and transparency P-value does NOT measure the size of the effect or the importance of the result. By itself the p-value cannot provide a good measure of evidence regarding the model.

Power: The other elephant in the room
Power is the TRUE positive In other words what is the probability you would reject the null under a specified value of the alternative. So first we need to figure out what value of the alternative we choose to calculate the power. This choice is up to us and we often call it the effect size.

Example of power For the hypothesis we talked about earlier: H0: µ≤2
Ha: µ > 2 If we observed a sample with a mean of 4 and a standard deviation of 2 from a sample of Calculate power when mu=2.5, 3, 3.5, 4 Power = P (Z> (2.5-2)/2/10 ) = P(Z > 2.5) = .0162 Etc.

Power in pictures

What are the players for power?
Sample size Effect size Standard deviation. So to really calculate power one needs to have data to understand the distribution and have a feel for the standard deviation. Pre-hoc power calculation is often “trying to fool some of the people some of the time”

Recap of Part 2: Section 1 To make inferences our population of interest and parameter needs to be well defined. Errors exist in testing and have to be considered P-values are a measure of incompatibility of the existing data given the null hypothesis and cannot be used to PROVE anything. To calculate power we need to look into values under the alternative and this can be subjective.

Worksheet for Section 1:
True or False: Type I error is always the worst hence we should focus on controlling it rather than Type II error. You believe that the average time it takes students to walk from one class to another at WSU is more than the 10 minutes you are allotted. Write out your null and alternate hypothesis Write down what the type 1 error would be in this context. You test it and get a p-value of .13, does this indicate that the null is true?

Big data, its pros and cons
Part 4: Big data, its pros and cons

What determines big data :The 5 V’s
Volume Considered too large for regular software Variety Often a mix of many different data types Velocity Extreme speed at which data generated Variability Inconsistency of the data set Veracity How reliable is this data

How big is big? By big we mean its volume is such that it is hard to analyze this on a single computer. That in itself shouldn’t be problematic But requiring specialized machines to analyze this has added to the myth and enigma of big data. The problem with big data, at least as I see it, is some very pertinent statistical questions are bypassed when dealing with it.

Some statistical thoughts?
Is the big data a sample or a population? If it is really a population: then analysis means constructing summary statistics. This is bulky but not too difficult. If it is a sample: what was the sampling frame? If no population was considered when collecting this data, it is definitely not a representative sample. So, should one really do inference on BIG data? If one is allowed to do inference wouldn’t the sheer size of the data, give us so much power that we can pretty much come to any decision we test for.

Structure of data Generally most data sets are rectangular in nature with p variables and n observations collected. In big data we often have many more predictors than observations (the big p problem) Many more (orders of magnitude more) observations than predictors, (the big n problem). Both n and p are big and are fluid as they are constantly updated and amassed.

The Variety and Velocity Piece
Generally opportunistic data is a mix of categorical, discrete, ordinal, continuous and a mix of that as well. So if we use it as a multivariate we have to think about how to proceed. While not trivial this can be surmounted with too much difficulty. The big issue is often the data is being amassed (I am not using collected intentionally) at a faster rate than it can actually be analyzed and summarized.

Variability and Veracity
This type of data is extremely variable and there is no systematic model in place to capture the components of variability. Modeling is very hard when you have no idea about the sources of variability in these types of data sets. Veracity: is it measuring what we think it is? How truthful is this data? Just because it is big, is it really good? O’Donoghue and Herbert: “Big data very often means 'dirty data' and the fraction of data inaccuracies increases with data volume growth.” (context of medical data)

Visualization of big data
Often called dashboards Really a collection of well known age old graphs that most of you can do in excel! It is really just summary data in pretty colors. Don’t be fooled by these fancy terms.

Example of a Dashboard.

Analysis versus Inference
As the whole question of whether it is a sample or a population in itself is muddy let us leave inference out for now and now focus on analyzing. A common analysis method associated with opportunistic data is predictive analysis.

Predictive Analytics and big data
Encompasses: prediction models, machine learning, data mining for prediction of the unknown using the history of the past. If we are predicting are we inferring?? I will assume it is okay to do that. Exploits patterns found in historical and allows assessment of risk associated with a particular set of conditions. Credit scoring has used predictive analytics for a long time However, here at least in the past sampling was done to perform inference.

Techniques used in Predictive Analytics or supervised learning
Regression techniques Logistic regression Time series models Survival or duration analysis Classification and Discrimination Regression Trees Modeled by humans etc., Neural networks Multilayer perceptron Radial basis functions Support vector machines Naïve Bayes k-nearest neighbors Geospatial predictive modeling Done by machines: no model etc., Analytical Methods Machine Learning Methods

Supervised Learning Idea is learning from a known data set to predict the unknown. Essentially we know the class labels ahead of time. What we need to do is find a RULE using features in the data that DISCRIMINATES effectively between the classes. So that if we have a new observation with its features we can correctly classify it. Machine Learning uses this idea and so it is very popular now. I will briefly cover this topic mostly with examples

Example 1: Turkey Thief There was this legal case in Kansas where a turkey farmer accused his neighbor of stealing turkeys from the farm. When the neighbor was arrested and the police looked in the freezer, there were multiple frozen turkeys there. The accused claimed these were WILD turkey that he had caught. The Statistician was called in to give evidence as there are some biological differences between domestic and wild turkey. So the biologist measured the bones and other body characteristic of the domestic and Wild turkeys and the Statistician built a DISCRIMANT function. They used the classification function to see if the turkeys in the freezer fell into he WILD or DOMESTIC class. THEY ALL fell in the DOMESTIC classification!

Steps Selection of features Model Fitting
Model Validation using prediction of known classes Feature selection is done by the computer No model, but computer determines the functions of the predictors used Model is validated based on prediction of known classes Discriminant Analysis Machine Learning

Feature Selection Find which of the observed variables can actually distinguish between the classes of interest. This is variable selection Don’t be confused and awed when people use terms like Elastic Net or Lasso at you in this context. These are fairly straightforward methods for model selection dealing with what type of error you minimize.

MODEL FITTING Commonly used: LDA K Nearest Neighbor QDA
Logistic Regression

Without models we can use Machine Learning methods
Neural networks Naïve Bayes Support Vector machines

Validation See how well the classifiers classify the observations into the different classes. Mostly commonly used method leave-one-out-cross validation. Though test data set (holdout sample) and resubmissions are still used.

Recap of Part 4 The sticky problem is if the data we have is a sample or a population. Inference is tough, as it is hard to figure out to what population we are inferring for. Predictive analytics often associated with big data At the end of the day, machines are faster and more efficient but cannot create interpretative models (not yet). We still don’t know if big data is good data, it depends upon who is collecting it and for what purpose.

Worksheet for Part 4 A Genome wide association study was undertaken to see if we could identify the different SNPs (Single nucleotide Polymorphisms) from two groups using a case-control set up. Each group had 5000 units and we looked at 1 million SNPs in each case. We also collected other control data from bot groups. Would you consider this a big data? Would you consider this an opportunistic data? Do you think we can do inferences in this case?

What we wanted to learn and what we did learn
Overview What we wanted to learn and what we did learn

Take home messages (hopefully)
There are several types of data and each type has its own nuances. The concept of population versus sample Experimental, observational and opportunistic studies Exploratory and confirmatory studies Distinction between univariate, multivariate To summarize type and dimension matters Graphs are worth a thousand words

More take home messages
Estimation and Testing Errors in testing Decision making, power and p-values Big data is still data and it can be bad data Machine learning and Statistics Dashboards are not in only in cars but a data visualization method used by “analytics” firms to produce well known graphs.

Myth of Big Data There is no myth, it is just unwieldy, unstructured, under- designed data that is already being amassed. It still has to be good data for us to make good analysis and predictions. At the end of the day to make inferences on data (big or small) we need it to be representative.

Section 3: CISER: What we do and how we can help Graduate Students, post-docs and faculty

Statistical Help at WSU: Center for Interdisciplinary Statistics Education and Research (CISER)
Our mission Education Collaborative Research Building a Community

Assistance types

PDI Data Literacy: Busting Myths of Big Data: Part2

Similar presentations

Presentation on theme: "PDI Data Literacy: Busting Myths of Big Data: Part2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PDI Data Literacy: Busting Myths of Big Data: Part2

Similar presentations

Presentation on theme: "PDI Data Literacy: Busting Myths of Big Data: Part2"— Presentation transcript:

Similar presentations

About project

Feedback