Statistics Chapter 1 Introduction to Statistics
Keep this in mind Statistics have very logical answers. Statistics can be linked with psychology and sociology Keep an open-mind, there are always 2 sides to a coin (positive and negative)
Quick Talk “1 out of 3 people cheat in a relationship” Discuss this statement What does it mean? Do you believe it? Why or why not?
Based on your discussion, why do you think it’s important to know about statistics?
Potential answers Know how authentic the statement is Don’t get “cheated” or “tricked” on Know where the data comes from Know how effective something is
What is statistics? Statistics: the study of how to collect, organize, analyze, and interpret numerical information from data
So then, what is needed? Individual: people or objects included in the study Variable: characteristic of the individual to be measured or observed
Quick Talk Think about dating. What “variables” do people look for when finding a boyfriend or girlfriend? List them
Variable comes in two types Quantitative variable: has a value or numerical measurement for which operations such as addition or averaging make sense (usually has numbers) Qualitative variable: describes an individual by placing the individual into a category or group such as male or female
Base on your list, identify them as quantitative or qualitative.
Sample Answers Age (quantitative) Weight (quantitative) Height (quantitative) Race (qualitative) Income (quantitative) Looks (qualitative) Body type (qualitative) Personality (qualitative)
Data Population data: data from “every” individual of interest Sample data: data from “only some” of the individual of interest
Quick Talk Compare the definition. Which type is more probable? Why?
Parameter: a numerical measure that describes an aspect of a population Statistic: numerical measure that describes an aspect of a sample
Easy way to remember Population with parameter Sample with statistic
Group work: Example #1 A car dealer wants to know what type of car people drive in the desert. He sent out 5000 surveys to random people living in the desert. A)identify the individual of study and the variable B)do the data comprise a sample? If so, what is the underlying population? C)is the variable qualitative or quantitative? D)Identify a quantitative variable that might be or interest E) Is the random sample a statistic or a parameter?
Answer A) individual: people in the desert Variable: car B) The data comprise a sample of the population of all people living in the desert C) qualitative D)Income, age E) statistic- computed from sample data
Example #2 Television station QUE wants to know the proportion of TV owners in Virginia who watch the station’s new program at least once a week. The station asked a group of 1000 TV owners in Virginia if they watch the program at least once a week A)identify the individual of study and the variable B)do the data comprise a sample? If so, what is the underlying population? C)is the variable qualitative or quantitative? D)Identify a quantitative variable that might be or interest E) Is the random sample a statistic or a parameter?
Levels of Measurement Nominal level of measurement: applies to data that consists of names, labels or categories. There are no implied criteria by which the data can be ordered from smallest to largest Ordinal level of measurement: applies to data that can be arranged in order. However, differences between data values either cannot be determined or are meaningless Interval level of measurement: applies to data that can be arranged in order. In addition, differences between data values are meaningful Ratio Level of measurement: applies to data that can be arranged in order. In addition, both differences between data values and ratios of data values are meaningful. Data at the ratio level have a true zero. (Means zero means something)
Example: Identify what level of measurement A)Taos, Acoma, Zuni, and Cochiti are names of four Native American pueblos from the population of names of all Native American pueblos in Arizona and New Mexico B) In a high school graduating class of 600 Students. Jeff ranked 1st, Melissa ranked 38th, Patrick ranked 150th, Ashley ranked 3rd, where 1 is the highest rank C) Body temperatures of trout in the Yellowstone River D) Length of shark swimming in the Pacific Ocean
Answer A) nominal B) ordinal C) interval D)ratio
Example #2 Name the levels of measurement A) My name is Mr. Liu B) I am 28 years old C) Highschool 1999-2003 College 2003-2007 Masters 2007-2008 D) I make $35,000 after tax E) I ranked 100th in highschool, 58th in college, 27th in Masters F) Some of my friend’s name are Michael, Katherine, Patrick, Ashley, Sarah, Mya, Chris. G) I am 5’8
Answers A) Nominal B) Ratio C) Interval D) Ratio E) Ordinal F) Nominal G) Ratio
Homework Practice Pg 10-11 #1-13 odd
Quick Talk Mr. Liu looked at the first 15 male students’ grades (which averages to a C) and made conclusion that of all the students in the school should have a C average. Discuss why this statement might not be correct. What is wrong with this study?
Things to remember If there is a study or data collect, it can not be BIASED in any way. You need to have a decent sample size and fair randomness to it. Fair = equal chance
1st type of data collection Simple random sample: Simple random sample of n measurements from a population selected in a manner such that every sample of size n from the population has an equal chance of being selected. Basically, everything has the same chance of getting selected.
Simple random sample example If I were to assign a number to each of the students here. (40 students) If I were to randomly choose 5 numbers, would the number 7 as likely to be selected as number 37? Could all 5 numbers be all odd? Could it ever be 27,28,29,30,31?
How to Draw a Random Sample 1) Number all members of the population sequentially 2) Use a table, calculator, or computer to select random numbers from the numbers assigned to the population members 3) Create the sample by using population members with numbers corresponding to those randomly selected
Read Example 3 in pg 13 Random-Number Table It is one of the way to create “randomness” in terms of number It is called a simulation
Another way Random Integer (randInt): Calculator TI83, TI84 Go to MATH Slide over to PRB Choose #5 It should show randInt( If you want ONE random number out of total of 500, you should type randInt(1,500) This will give you a random number between 1 and 500 If you want 30 random numbers out of total of 500, you should type randInt(1,500,30)
2nd type of data collection Simulation (usually with number): a numerical facsimile or representation of a real-world phenomenon Note: Productive in studying nuclear reactors, cloud formation, cardiology, highway design, production control, shipbuilding, airplane design, war games, economics, and electronics.
Quick Talk: Why do you think it is important to use simulation as a data collection method? (think about the application field we just discussed)
Group Activity In your group, create a sample simulation of a coin-tossing event 10 times One person will record One person will use a coin (head or tail) One person use calculator (1=head, 2=tail) One person use the table from the back of the book (odd=head, even=tail) You should have a total of 30 trials. Answer this question: What is the theoretical probability of getting head? What is experimental probability of getting head?
Answer Theoretical probability: 50% Experimental probability: depends on your group
Sampling: Different ways to create “randomness” Stratified sampling: Divide the entire population into distinct subgroups called strata. The strata are based on a specific characteristic such as age, income, education level, and so on. All members of a stratum share the specific characteristic. Draw random samples from each stratum Systematic sampling: Number all members of the population sequentially. Then, from a starting point selected at random, include every kth member of the population in the sample Cluster sampling: Divide the entire population into pre-existing segments of clusters. The clusters are often geographic. Make a random selection of clusters. Include every member of each selected cluster in the sample. Multistage sampling: Use a variety of sampling methods to create successively smaller groups at each stage. The final sample consists of clusters. Convenience sampling: Create a sample by using data from population members that are readily available (potential to have lots of bias).
Vocabulary dealing with sampling Sampling frame: a list of individuals from which a sample is actually selected Undercoverage: results from omitting population members from the sample frame Sampling error: the difference between measurements from a sample and corresponding measurements from the respective population. It is caused by the fact that the sample does not perfectly represent the population. Nonsampling error: result of poor sample design, sloppy data collection, faulty measuring instruments, bias in questionnaires, and so on
Note: Remember, is it possible to get a “population” sample? We have to use sample to predict the population. Sample is not a perfect representation of the population! Sampling error do not represent mistakes! They are just the consequences of using samples instead of population. Nonsampling error do occur, be aware of them! Avoid bias and sloppy data collection leading to false-truth, or truth-false (false-positive)
Homework Practice Pg 17-19 #1-5, 7, 13, 15
Quick Talk Why is planning a good experimental design important? Think about what we learned.
2 more types of Data collection techniques 1) Experiment (most stringent and restrictive) 2) Observational (Somewhat convenient) Census 3) Survey (most convenient way to collect data)
Basic Guideline for planning a statistical study 1) Identify the individuals or objects of interest 2)Specify the variables as well as protocols for taking measurements or making observations 3)Determine if you will use an entire population or a representative sample. Decide on a viable sampling method 4)In your data collection plan, address issues of ethics, subject confidentiality, and privacy. If you are collecting data at a business, store, college, or other institution, be sure to be courteous and to obtain permission as necessary. 5)Collect the data 6) Use appropriate descriptive statistics methods and make decisions using appropriate inferential statistics methods 7) Finally, note any concerns you might have about your data collection methods and list any recommendations for future studies.
Quick talk: You are a researcher in a biotech company. You are trying to find out the efficacy and the effectiveness of a vaccine. How would you conduct the experiment? Note: statistics is needed for ALL medical companies to test the effectiveness of their technology or medicine
3rd type of data collection: Experiments (zombie creation?!?!) Completely randomized experiment: one in which a random process is used to assign each individual to one of the treatments Block: a group of individuals sharing some common features that might affect the treatment Randomized block experiment: individuals are first sorted into blocks, and then a random process is used to assign each individual in the block to one of the treatments.
Two types of group Experimental Group: treatment is deliberately imposed on the individuals in order to observe a possible change in the response or variable being measured. (one receiving actual treatment) Control group: Receives dummy treatment, enabling the researchers to control for the placebo effect. It is used to take account for the influence of other known or unknown variables that might be an underlying cause of a change in response in the experimental group (one receiving fake treatment)
Placebo effect (aka sugar pill effect): occurs when a subject receives no treatment but (incorrectly) believes he or she is in fact receiving treatment and responds favorably. (In the control group only)
Remember, a good experiment needs: 1) Randomization: used to assign individuals to two treatment groups. This helps prevent bias in selecting members for each group. It helps with the data collection. 2) Replication (re-test): repeating the same experiment to reduce the possibility that the difference in pain relief for the two group occurred by chance alone. Note: Many experiments are also “double-blinded”. It means that neither the individual nor the observer know which subject are receiving treatment. It controls the biases that a doctor might have on a patient.
Quick talk What is the downside of using experimental as a data collection technique? What is the downside of using sampling as a data collection technique?
4th data collection Census (type of observational study): measurements or observations from the entire population are used. US do it every 10 years Although, it’s still impossible to get to everyone, e.g. homeless people. So estimates are used. Sample (sampling; need with observational study): Measurements or observations from part of the population are used (mostly used)
Observational study: observations and measurements of individuals are conducted in a way that doesn’t change the response or the variable being measured.
5th type of data collection: Survey A useful tool to gather data (without experimenting) is by using surveys. It is a type of observational study
Downside of survey Nonresponse: Individuals either cannot be contacted or refuse to participate. Can result in significant under coverage of a population Truthfulness of response: Respondents may lie intentionally or inadvertently Faulty recall: Respondents may not accurately remember Hidden bias: The question may be worded in such a way to elicit a specific response. Vague wording: Words such as “often”, “seldom” and “occasionally” mean differently to different people Interviewer influence: factors such as tone of voice, body language, dress, gender, authority, and ethnicity of the interviewer might influence responses Voluntary responses: Individuals with strong feelings about a subject are more likely than others to respond. Such a study is interesting but not reflective of the population.
Downside of all data collection Lurking variable: one for which no data have been collected but that has influence on the other variables in the study Two variables are confounded when the effects of one cannot be distinguished from the effects of the other. Confounding variables may be part of the study, or they may be outside lurking variables.
Quick Talk Have you guys ever used Yelp or other similar app? Discuss if there is ever a case the comment section “influenced” you in some type of decision. How does it relate to the “downside of survey”?
Review We have learned different data collection methods. Look in your notes, and in your group list all the collection methods.
Answer experiment, census, simulation, sampling (w/ survey)
Which type of data collection do you think is the most appropriate for the following studies? 1) Study of the effect of stopping the cooling process of a nuclear reactor 2) Study the amount of time students watching tv while studying 3) Study on the effects of weight-loss pill given to women 4) Study the credit each student enrolled at the high school at the end of 1st semester.
Group work Comment on the usefulness of the data (both positive and negative) 1) Interviewer asks the interviewee if they have taken drugs this year. 2) Jessica saw some data that show that cities with more low-income housing have more homeless people. Does building low-income housing cause homelessness? 3) You look at the reviews on yelp to determine the wellness of a restaurant 4) Extensive study on cancer conducted using men over 40
Homework Practice Pg 26-27 #1-5 odd
Review Practice P29 #1-9