Download presentation
Presentation is loading. Please wait.
1
Overview of probability and statistics
2
Data Scientists are constantly dealing with collections of facts, or data. The discipline of statistics provides methods for organizing and summarizing data and for drawing conclusions based on information in the data.
3
Population and samples
An investigation will typically focus on a well-defined collection of objects that make up the population of interest. When the desired information is available for all objects in the population, we have a census. Constraints on time, money, availability, etc. usually make a census impractical. Instead, a subset of the population, a sample, is selected.
4
Branches of statistics
In descriptive statistics, the investigator simply summarizes and describes important features of the data. Graphs (e.g. histograms, boxplots) and numerical summaries (e.g. sample means, standard deviations, and correlation) are used. In inferential statistics, the investigator uses information from the sample to draw conclusions (make inferences) about the population.
5
Probability versus statistics
In probability, properties of the population are modeled and parameters of the model are assumed known. Questions regarding the sample are answered. The model is typically an approximation (hopefully a very good one) of the true process generating the data.
6
Probability versus statistics (cont.)
In statistics, characteristics of a sample are used to draw conclusions about the population. Before we can understand what a particular sample can tell us about the population, we should first understand the uncertainty associated with taking a sample from a given population. That is one reason why we study probability before statistics.
7
Collecting data Statistics also deals with methods to properly collect data so that the investigator will be able to answer relevant questions. One problem is that the target population may be different from the population that is actually sampled.
8
Ways to collect data Simple random sampling Stratified sampling
Convenience sample Designed experiment
9
Simple random sampling
This method requires a frame (a list of the population units) Every unit has exactly the same chance of being in the sample One can pick numbers out of a hat or use a random number generator to pick the sample
10
Stratified sampling The population is separated into non-overlapping groups and a sample is taken from each group. This helps to ensure that no one group is over- or under-represented in the sample
11
Convenience sampling The individuals are selected without systematic randomization An example is a phone-in poll or an internet poll There is always the question of whether this type of sample is representative of the population (e.g., only people with strong opinions take the time to phone in to a telephone poll)
12
Designed experiment Different treatments (such as fertilizers or coatings for corrosion protection) are allocated to various experimental units (plots of land or pieces of pipe). The levels of the factors making up the treatments are varied to study their effects We have means of dealing with variables that we don’t want to affect the outcomes (e.g., we can keep them fixed across the experiment) This design is better than sampling if we want to establishing causation
13
Numerical measures of location
The sample mean is given by The sample median is the ordered value if is odd, and the average of the ordered values if is even.
14
Measures of variability
The simplest measure of variability is the range, the difference between the largest and smallest observation Another measure is the sample variance, defined as The sample standard deviation is
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.