Being data literate: Are big data and good data synonymous?

Being data literate: Are big data and good data synonymous?
Nairanjana (Jan) Dasgupta Professor, Dept. of Math and Stats Boeing Distinguished Professor of Math and Science Director, Center of Interdisciplinary Statistics Education and Research (CISER) Washington State University, Pullman, WA

Topics SOURCE of data Types of Data Population versus Sample
Experiment versus observational studies Exploratory and Confirmatory studies Summarizing Data Inference: Going from sample to population Errors in testing: Type I and Type II P-value — good, bad or misused A few comments about BIG data

Data: GOOD, BAD or the culprit?
I will misquote Samuel Taylor Coleridge here (his quote was about water), when the ancient mariner was stuck in the middle of the ocean: Data data everywhere, and it really makes us blink Data, data everywhere but we’ve got to stop and think… This is the theme of today’s lecture: understanding data, types of data and what we can and cannot do with data. More of a holistic view of data.

Why collect Data? The short answer: to answer a question we have in mind In the past data was collected via experiments or observational studies. However, nowadays SOME data is collected without a specific object in mind, just because everything is data oriented, and collection is easy. In general, we want to learn something that we do not know from data.

Data Source: experiments and observational studies

However now data sources are different:
Data are generated not collected. No study design associated with its collection Often unclear what we want it to tell us: we are often doing a stab in the dark approach. This is often what we call BIG data I would like to coin the phase “opportunistic data” for this type of data that is not collected with a specific aim in mind, like social media data or phone data.

BIG Data: Some thoughts
Not much ACTUAL data analysis done as the challenge is to actually manage and extract. Mostly graphs and “dashboards” Having more doesn’t solve the problem if the data is not “GOOD” to start off with. Has more problems with BIAS as it is not collected in a systematic way. Issues with dimensionality. Extreme problems of multiple testing and false positives

Types of data Discrete Numerical Continuous Data Categorical Nominal
Ordinal

What are these? Nominal: Name, category Discrete: what you count
Continuous: what you measure Ordinal: ordered categories What is your eye color? How many Statistics classes you have taken? Your blood pressure when faced with a Stats problem On a scale of 1 to 5 rate your liking for Stats

Population versus Sample
Population: Sum total of all individuals and objects in a study Sample: part of the population selected for the study Idea: Get a good sample, study this sample carefully and infer about the population based on the sample. Statistics is involved is in both these arrows Sample: make sure you stress that if we don’t study it it is not the sample

Where does statistical science come in?
If we could always study the population directly: we wouldn’t need Statisticians except for clerical jobs. If we are relying on samples: we need to take a GOOD sample: the good is defined by Statistics What type of sample can I take? How exactly can I “infer” attributes about the population (parameter) based on a sample (statistic)? Caveat: the population at hand needs to be a REAL representative population. The information I took on you, observation or experimental?

Experiments versus Observational Studies
Experiments: You change the environment to see the effect your change had, trying to control all other potential factors that might affect your study. Observational Study: You study the environment as is, and collect data on all possible factors that might be of interest to you. Talk about designing the study where we are comparing three brands of cake mixes. Discuss Type of mix, temperature, oven effect, the placement effect

What does it matter the type of study we conduct?
It matters because how we proceed to analyze the data should differ in terms of the type of study we had. Nowadays it is common to have data collected “just because” or “opportunistic data” and these are the extreme types of observational studies. As it wasn’t collected without any aim and it is hard to figure out if it is a population or a sample.

Exploratory versus Confirmatory studies
Exploratory studies: We do not have an idea about what we expect to find. So we study a bunch of factors to see how they affect what we are studying. It can be experimental or an observational study (though generally observational) Confirmatory Studies: Generally have an idea what we are expecting to find and do a very focused study to give credence to our beliefs. It can be experimental or an observational study (though generally experimental) Keep in my mind we cannot use data collected in an exploratory study to confirm our belief or hypothesis. Liken this to a fishing expedition and give examples of these kinds of study

Summarizing Data Data (whatever the source), type, structure and purpose is just a bunch of numbers. Before we can do anything with data, we need to summarize it somehow. We can summarize data numerically or graphically. Though loads of importance given to numerical summary, graphical summary is just as important and often more powerful (and easier to deceive with).

Data: Graphical Summary
Box plots Histograms Pie charts Line graphs Scatter plots Dendograms Heat maps Network plots The list goes on and on and more complicated people will dazzle you with colors and pictures and the interactive-ness

Example of Misleading graph!

Summarizing: Measures of Central Tendency
One way to summarize any kind of data is to see what is happening in the center. Measures of Center Mean: Median: Mode:

More Summarization: Measure of Center, provide us with a first step for summarizing data. But often it cannot differentiate between data sets. Consider the following data sets: Set1: Set2: Set3: They all have same mean and median=40. Are they identical? What makes them different?

Other summary measures
Measures of Spread Shapes of the distribution of data Where is the peak Measures of symmetry Percentiles

When and Why we need inference
IF the data we collected was really a population we do not need to do any inference. This is an issue with opportunistic data. Can we do inference? But in general we do not know the population, and we study the sample to INFER something about the population.

Inference: Use data and statistical methods to infer (get an idea, resolve a belief, estimate) something about the population based on results of a sample Inference Estimation Point Estimation Interval Estimation Hypothesis Testing

Estimation We have no idea at all about the parameter of interest and we use the sample information to get an idea (estimate) the population parameter Point Estimation: Using a single point to estimate the population parameter Interval Estimation: We use an interval of values to estimate the population parameter

Hypothesis Testing We have some idea about a population parameters.
We want to test this claim based on data we collect Probably one of the most used and most misunderstood method used in science. This provides us with the “dreaded p-value”.

Parameter To understand inference, we really need to get a very clear idea about what is a parameter. By definition: Parameter is a numerical characteristic (or characteristics if multivariate) of the population that is of interest. To make this specific: let us consider the following example Example: We are interested in the average number of statistics classes students have taken when they come to graduate school at WSU.

Population and Parameter:
Here the population is all graduate students at WSU. The parameter is: the average number of statistics classes taken by the students. Question: IF I could study every single graduate student at WSU now, would this be a population or would it be a sample??

Hypothesis Testing: some idea about the parameter
We have some knowledge about the parameter A claim, a warranty, what we would like it to be We test the claim using our data; First step: formulating the hypothesis Always a pair: Research and Nullification of research (affectionately called Ho and Ha)

Logic of Testing To actually test the hypothesis, what we try to do is to disprove or reject the null hypothesis. If we can reject the null, by default our Ha (which is our research) is true. Think of how the legal system works: H0: Not Guilty Ha: Guilty

How do we do this? We take a sample and look at the sample values. Then we see if the null was true, would this be a likely value of the sample statistic. If our observed value is not a likely value, we reject the null. How likely or unlikely a value is, is determined by the sampling distribution of that statistic.

Errors in testing: Since we take our decisions about the parameter based on sample values we are likely to commit some errors. Type I error: Rejecting Ho when it is true (False Positive) Type II error: Failing to reject H0 when Ha is true (False Negative) In any given situation we want to minimize these errors. P(Type I error) = a, Also called size, level of significance. P(Type II error) = b, Power = 1-b, HERE we reject H0 when the claim is true. We want power to be LARGE. Power is the TRUE Positive we want.

An example: Ann Landers in her advice column on the reliability of DNA testing for determining paternity advises, “To get a completely accurate result you would have to be tested, so would the man and your mother. The test is 100% accurate if the man is NOT your father, and 99.9% accurate if he is. Consider the hypothesis: Ho: a particular man is the father Ha: a particular man is not the father. Discuss the chances of probability of Type I and II errors.

Decision Making using Hypotheses:
In general, this is the way we make decisions. The idea is we want to minimize both Type I and II errors. However, in practice we cannot minimize both the errors simultaneously. What is done, is we fix our Type I error at some small level, ie 0.1, 0.05 or 0.01 etc. Then we find the test that will minimize our Type II error for this fixed level of Type I error. This gives us the the most powerful test. So in solving a hypothesis problem, we formulate our decision rule using the fixed value of Type I error. The decision rule is also called the CRITICAL VALUE.

P-values: what is it? Sometimes hypothesis testing can be thought to be subjective. This is because the choice of a-values may alter a decision. Hence it is thought that one should report p- values and let the readers decide for themselves what the decision should be. p-value or probability value is the probability of getting a value worse than the observed. If this probability is small then our observed is an unlikely value under the null and we should reject the null. Otherwise we cannot reject the null.

Criticism of p-values As more and more people used p-values and with an effort to guard against the premise that “we can fool some of the people some of the time”, journals started having strict rules about p-value. To publish you needed to show small p-values. No SMALL p-values no publication… This has often led to publication of ONLY significant results. Also, led to let us get the p-value small by hook or crook attitude.

For Big data: what are we trying to learn
Is it a population that we have or is it a sample? If the latter, then what is the population? How can this sample be used to infer about a population when no frame was used to draw the sample? And statisticians and data scientists we need to think about this. And big data is not always good data.

Take home messages (hopefully)
There are several types of data and each type has its own nuances. The concept of population versus sample Experimental, observational and opportunistic studies Exploratory and confirmatory studies To summarize type and dimension matters Graphs are worth a thousand words Bigger is not always better…

CISER: A Brief overview

Being data literate: Are big data and good data synonymous?

Similar presentations

Presentation on theme: "Being data literate: Are big data and good data synonymous?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Being data literate: Are big data and good data synonymous?

Similar presentations

Presentation on theme: "Being data literate: Are big data and good data synonymous?"— Presentation transcript:

Similar presentations

About project

Feedback