Presentation is loading. Please wait.

Presentation is loading. Please wait.

Senior Statistical Criminologist

Similar presentations


Presentation on theme: "Senior Statistical Criminologist"— Presentation transcript:

1 Senior Statistical Criminologist
Statistics The Basics By Daniel Downs Senior Statistical Criminologist February 2017

2 What We Are Going to Cover? The Basics
Data Structure Central Tendency Dispersion Distributions Statistical Significance Basic Stat Testing T-test Regression

3 Data Prep – Basics – Questions to Ask
How many missing values are there? Are there strange min/max values? Unusual patterns – fraud detection Are there strange mean values? Or large differences between the median and mean? Is there large skew? Are values of categorical variables in range? Are any variables highly correlated to each other? Redundancy

4 Some Common Symbols N The sample size or total number of cases. Xi A single score or case, where i is the score’s location in the distribution. ∑ This is the summation operator (sigma notation). S Standard deviation Y Dependent variable (response) X Independent variable (predictor variable) Add all the raw scores, starting with one and ending when all numbers have been added. This tells us what to add.

5 Variables Analyses are dictated by your variables Two types
Numeric (aka continuous) Examples: # of stolen items, return $, # of arrests, difference between dates Can contain numbers on a real number line Categorical Takes on the value of a characteristic Examples: type of theft (internal or external), type of property (coded): 1 – Big Box 2 – Small Box 3 – Discount Store 4 – Warehouse Store 5 – Other Note. Keep in mind the numbers are not a measure

6 A data set may have a mixture of data types.
Variables cont. A data set may have a mixture of data types. Types of Data Attribute (qualitative) Numerical (quantitative) Verbal Label X = economics (your major) Coded X = 3 (i.e., economics) Discrete X = 2 (your siblings) Continuous X = 3.15 (your GPA)

7 Measures of Central Tendency
Focus is on describing the distribution an arrangement of values of a variable showing their observed frequency of occurrence. Mean (average), Mode (most occurring), Median (50th percentile) Which of the three measures of central tendency should be used to represent a distribution? The answer is not straight forward. It is contingent on: the distribution, representation the data, and the meaningfulness of the data. Mean x̄ = sum of the observations n The mean (average) number of apprehensions per store is 151/8 = The mode = 10 and the median 19.5

8 Measures of Central Dispersion
How spread out is the distribution Range Order values from least to greatest and take the difference Misleading due to outliers Standard deviation How spread out the data are (variability); How far the values are from the mean Larger the standard deviation the bigger spread

9 Measures of Central Dispersion Cont.
Take for example the following three data sets: 5, 5, 5, 5  Mean =5, Standard Deviation = 0 4, 4, 6, 6  Mean =5, Standard Deviation = 1 3, 3, 7, 7  Mean =5, Standard Deviation = 2 The mean is 5 for all three sets of data, but the standard deviations are 0, 1, 2, respectively. The larger the standard deviation the more the values in the set vary from the mean.

10 Normal Distribution The empirical rule: states that almost all of the data falls within 3 standard deviations of the mean. outliers are defined as observations that lie beyond the mean + 3 standard deviations If a data set has an approximately bell-shaped (normal) distribution, then approximately 68% of the observations lie within one standard deviation of the mean, that is, in the range x̄±1s  approximately 95% of the observations lie within two standard deviations of the mean, that is, in the range x̄±2s  approximately 99.7% of the observations lies within three standard deviations of the mean, that is, in the range x̄±3s

11 Normal Distribution Example
We have a normal distribution of apprehensions for 1,000 stores w/ a mean number of apprehensions of 100 and that the standard deviation S is 10. How does this help you? the avg # of apprehensions is 100 and S is 10... then about 68% of the stores had between 90 to 110 apprehensions, since = 90 and = 110. In other words, about 680 of the 1000 stores have apprehensions between 90 and 110. The apprehensions that are two standard deviations from the mean range from 80 to 120 since (10) = 20 and (10) = 120. From the Empirical Rule, we know that about 95% of all stores apprehensions will fall within this range. Thus, about 950 of the 1,000 stores fall in this range.

12 Standardizing – Z-Score
Sometimes we may want to look at one specific store in regards to risk Is it above the average in risk? We may also want to compare stores across distributions The z-score for a store, indicates how far and in what direction, that store deviates from its distribution's mean, expressed in units of its distribution's standard deviation. This is helpful because standardizing data puts your variables on one scale so comparisons are easy to make.  Standardization formula: Data that is two standard deviations below the mean will have a z-score of -2, and data that is two standard deviations above the mean will have a z-score of +2.  

13 Correlations Correlation measures the association between two numeric variables and takes on values between “-1.0” and “1.0”. 0 = no relationship Pos and Neg Knowing the value of one variable tells you about the other variable As r increases data looks more like a line Scatter Diagrams Misconception: correlation <> causation Example: Store sq ft and sales

14 P-values and Hypothesis Testing
A key element of inferential statistics is hypothesis testing. H0: No Relationship Our goal as researchers is to try to disprove this H1: Relationship P-value (The p-value will always take on values between 0 and 1) A p - value is generated during a hypothesis test by using the distribution of a test statistic. Small p-values lead us to believe that H1 is true and that H0 is not true. A standard cutoff to conclude that “the data tends to indicate that H0 is not true” is a p-value less than 0.05. This can be interpreted as: “there is less than a 5% chance that these results are due to chance.” Observed difference is not random

15 T-test One sample H0: m = m0 (the mean of the group is equal to the target value) H1: m ≠ m0 (the mean of the group is different than the target value) If the p-value < .05 we reject the NULL If the observed difference is larger than what would be expected just by chance, then it is labeled statistically significant.

16 T-test cont. Two sample H0: m1 = m2 (The average between group 1 and group 2 is equal) H1: m1 ≠ m2 (The average between group 1 and 2 is not equal) If the p-value < .05 we reject the NULL

17 Intro to Modeling Predictive and Inferential Models
A statistical model is typically built for one of two reasons Either the modeler would like to describe the relationship between the predictor variable/s and the response variable, or he/she would like to make predictions. The process for fitting an inferential and a predictive model are the same, but the usage of the model may be dramatically different. An inferential model is used for inferring information about a population based on a sample, whereas a predictive model is generally concerned with making predictions about Y from future observations based on knowing only the values X1, X2, …, Xk

18 Intro to Modeling Building a description of something
The fixed mathematical formula within the model is what is known as the explainable portion of the model and the random fluctuations represent the unexplainable part of the model. A model is a function of the predictor variables to predict or explain a response. In the equation above, the explainable portion of the model is the function “g()” which combines the predictor variables together to explain as much as it can about the response variable.

19 Regression A simple linear regression can be defined by the equation:
Y = a + bX + error Parameters = a and b a is the intercept of the line, the value of Y when X = 0 where b is the slope of the line, the amount by which Y changes when X increases by one unit Error = residual For each point We can calculate: Actual Y Predicted Y Residual (actual-predicted) The smaller the error/residual the better our prediction.

20 Regression cont. A simple linear regression can be defined by the equation: Y = a + b × X + error, where b is the slope of the line, the amount by which Y changes when X increases by one unit, and a is the intercept of the line, the value of Y when X = 0. Y= – × X, where X is height in feet and Y is weight in pounds. The intercept (a = –126.40) is the value of the dependent variable when X = 0. For every unit increase in the IV (X1), the DV will increase by 52.23 In this model, a person’s weight increases by pounds with each additional foot of height. Prediction: for a person whose height is 6 feet, the predicted weight is 187 pounds (y = – × 6 feet). How well does the IV predict the DV or explain the variance in the DV

21 Residual Plots The residuals should be centered on zero throughout the range of fitted values. The model is correct on average for all fitted values. (constant variance)

22 Extrapolation When we make predictions for Y based on values of X, we want to stay within the range of our values for X used in the data (Interpolation) Going outside the range of the X variables to make a prediction about Y is known as extrapolation. Consider, for example, if we built a model between height (Y) and age (X) on kids age 5 to 15, we would not want to use that model to predict height on individuals that were 45. We would likely estimate their height to be extremely tall. Interpolation = Good  Extrapolation = Bad 


Download ppt "Senior Statistical Criminologist"

Similar presentations


Ads by Google