Texas a&m transportation institute Safe-D project

Slides:



Advertisements
Similar presentations
Statistical Analysis SC504/HS927 Spring Term 2008
Advertisements

Simple Logistic Regression
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
Multiple Linear Regression Model
Chi-square Test of Independence
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
1 Chapter 16 Linear regression is a procedure that identifies relationship between independent variables and a dependent variable.Linear regression is.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Copyright © 2014 Pearson Education, Inc.12-1 SPSS Core Exam Guide for Spring 2014 The goal of this guide is to: Be a side companion to your study, exercise.
SW388R6 Data Analysis and Computers I Slide 1 Chi-square Test of Goodness-of-Fit Key Points for the Statistical Test Sample Homework Problem Solving the.
© Willett, Harvard University Graduate School of Education, 8/27/2015S052/I.3(c) – Slide 1 More details can be found in the “Course Objectives and Content”
Chapter 13: Inference in Regression
Chapter 13 – 1 Chapter 12: Testing Hypotheses Overview Research and null hypotheses One and two-tailed tests Errors Testing the difference between two.
Multiple Choice Questions for discussion
The Argument for Using Statistics Weighing the Evidence Statistical Inference: An Overview Applying Statistical Inference: An Example Going Beyond Testing.
Chapter 15 Data Analysis: Testing for Significant Differences.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
11/16/2015Slide 1 We will use a two-sample test of proportions to test whether or not there are group differences in the proportions of cases that have.
6.1 Inference for a Single Proportion  Statistical confidence  Confidence intervals  How confidence intervals behave.
Chapter 6: Analyzing and Interpreting Quantitative Data
12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
1/5/2016Slide 1 We will use a one-sample test of proportions to test whether or not our sample proportion supports the population proportion from which.
URBDP 591 I Lecture 4: Research Question Objectives How do we define a research question? What is a testable hypothesis? How do we test an hypothesis?
1 Probability and Statistics Confidence Intervals.
CHAPTER 7: TESTING HYPOTHESES Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for a Diverse Society.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Dr.Theingi Community Medicine
Transportation Planning Asian Institute of Technology
Logistic Regression: Regression with a Binary Dependent Variable.
Other tests of significance. Independent variables: continuous Dependent variable: continuous Correlation: Relationship between variables Regression:
Looking for statistical twins
Chi Square Test Dr. Asif Rehman.
Chi-square test.
Logic of Hypothesis Testing
Correlation and Linear Regression
BINARY LOGISTIC REGRESSION
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Logistic Regression APKC – STATS AFAC (2016).
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Logistic Regression II/ (Hierarchical)
Lecture Slides Elementary Statistics Twelfth Edition
Notes on Logistic Regression
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
Chi-Square X2.
Hypothesis Testing Review
APPROACHES TO QUANTITATIVE DATA ANALYSIS
Analyzing and Interpreting Quantitative Data
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Generalized Linear Models
R Data Manipulation Bootstrapping
Multiple logistic regression
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Tabulations and Statistics
Inference for Categorical Data
Chapter 11: Inference for Distributions of Categorical Data
Categorical Data Analysis Review for Final
Logistic Regression.
Lesson 11 - R Chapter 11 Review:
15.1 The Role of Statistics in the Research Process
Texas A&M Transportation Institute Safe-D Project AVID Slides
Multiple Regression – Split Sample Validation
Interpreting Epidemiologic Results.
One-Factor Experiments
ESTIMATION.
Comparing Two Proportions
Comparing Two Proportions
Chapter 6 Logistic Regression: Regression with a Binary Dependent Variable Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Exercise 1 (a): producing individual tables, using the cross-tabs menu
Stratified Covariate Balancing Using R
Presentation transcript:

Vehicle Occupants and Driver Behavior: An Assessment of Vulnerable User Groups Texas a&m transportation institute Safe-D project 02-009 Learning module-educational transfer

overview Project Objectives Introducing the Dataset Data Cleaning & Processing Model Development Process Creating a Model: Example in R Interpreting Model Results

Project objectives

Speeding & occupancy Determine whether speeding is more prevalent while traveling with occupants for: Teenagers Drivers (age ≥ 25 & <50) Traveling with Children Seniors (Age 65+) Use data traditionally used for transportation planning purposes for safety research Based on results obtained from models, recommend safety countermeasures

Introducing the dataset

Texas Department of Transportation Travel Survey Program Performed on a rotational basis for study areas across Texas Traditionally used in developing rates for planning and travel demand modeling Sample sizes range from 2,000 to 6,000 households GPS Data of approx. 10% of the HH

Data Study Areas Abilene Bryan-College Station Corpus Christi El Paso Houston Midland-Odessa San Angelo Sherman-Denison Texarkana Victoria Wichita Falls

HERE Network data Highly detailed roadway network Posted speed limits Example Variables: LinkID Functional class Number lanes Lane type (e.g., bike lanes) Pavement markings Direction of travel Z-levels (grade separated)

Data cleaning & processing

Travel Diary / GPS Data Linking Original purpose Spatial link of origin & destination locations Temporal link of departure and arrival times Imperfections between observed and revealed responses Various kinds of linkages based on different parameters O D Lat/Long Coordinates Departed: 10:00 AM Arrived: 10:45 AM

Map matching

Creating a Model: Example in r Install and load libraries Set up working directory Load input files Prepare tables for analysis Perform binomial logistic regression

1. Install and load libraries Move the cursor to install.packages(“dpylr”) and run the lines one by one until all the packages are installed and loaded. Note : While installing packages, it is advised to run the install statements one by one rather than selecting multiple and running them all at once. Useful Shortcuts for the exercise: Description Windows & Linux Mac Run current line/ selection Ctrl+Enter Command+Enter Comment/ uncomment current line/ selection Ctrl+Shift+C Command+Shift+C

2. Set up working directory Set the working directory as your folder where you have saved the data files. Using setwd(“C:/Working Directory/Data Files”) statement. Note: If you are using Windows, make sure you swap the slashes to forward slash(/) after copying the address from file explorer.

3. Load input files To run the model, samples from various study areas need to be merged and split into age groups. Merge all the rows using ‘rbind’ to create one dataframe called ‘myMergedData’

4. Prepare tables for analysis Replace NAs with zeroes. Note : While working on R, always replace or remove NAs in a columns if you intend to perform a calculation. Columns containing NAs always return NAs while performing calculations.

4. Prepare tables for analysis (contd.) For the purpose of modeling, the following new variables are needed. Some of these are converted to referent groups and are identified by the value of ‘0’. A referent group is a group to which another group is compared. unique_driver_id: speeding_10pct_ff_dur: Speeding 10% of the trip duration. 1: true; 0: false

4. Prepare tables for analysis (contd.) speeding_20pct_ff_dur: Speeding 20% of the trip duration. 1: true; 0: false trip_occupancy_type: Number of passengers child_occupancy_type: Number of children trash_flag: a flag to exclude entries with zero trips ethnicity_code: White is the referent group study_area_code: Sherman Dennison is the referent group

4. Prepare tables for analysis (contd.) Covert existing variables to referent group: Driver’s sex and employment status Convert the above variables to factors (categorical variables in R) female was 2 now 0 unemployed (i.e. no) was 2 now 0

4. Prepare tables for analysis (contd.) Create Data Subsets: Older drivers 65+ years old Younger drivers between 16 to 24 years old All drivers 16+ years old Create a cross table to assess the sample sizes in each group.

5. Perform binomial logistic regression Use ‘glm.cluster’ function on various variables to calculate robust standard errors Older Drivers: 20% Speeding drivers ->ethnicity, study area, employment, age, sex, adult passengers Younger Drivers: 20% Speeding drivers -> study area, adult passenger, child passenger All Drivers with Child: 20% Speeding drivers -> adult passengers, child passenger, ethnicity, study area, employment, age, sex, adult passengers Calculate Odds Ratio and 95% Confidence Interval using the robust standard errors calculated above

5. Perform binomial logistic regression NOTES: Binomial logistic regression was selected because the outcome was dichotomous (yes/no) Robust standard errors were used since data were clustered by driver (e.g., multiple trips per driver) thus violating the assumption of independence of observations The terms for study areas were forced (i.e., allowed to remain in the model regardless of their statistical significance) into the model since data also were sampled by study area.

5. Perform binomial logistic regression NOTES continued: Prior to the model building process, determine the approach (e.g., forward selection, backward elimination) Set your level of statistical significance or alpha (α). A level of 0.05 which corresponds to a confidence interval of 95% is common

Interpreting Model ResultS

Odds Ratios Logistic regression models produce coefficients that can transformed into odds ratios (ORs). ORs can be easier and more intuitive to interpret. Odds ratios range from 0 to infinity. The null value of an OR is 1.00. This value means that there is no indication of a relationship between an independent and dependent variable Positive relationship: OR greater than 1.00 Negative or inverse relationship: OR less than 1.00 If the 95% CI includes 1.00, the result is not statistical significant at the α=0.05 level

Odds Ratios Examples The odds ratio below was produced from a model (with more than one independent variable) constructed to estimate the association between having a child passenger in the vehicle and speeding among drivers ages 16-55 years old. Presence of child passenger: adjOR=0.73; 95% CI=0.60-0.87 0.73 is less than 1.00; speeding is less common when a child passenger is present This is called a protective effect since speeding is less when a child passenger is present The odds of speeding were 27% lower (1-0.73=0.27) among drivers with a child passenger present compared to drivers without a child passenger present Said another way, the odds of speeding among those with a child passenger are .73 times the odds of speeding among those without a child passenger This estimate is statistically significant because the 95% CI does not include 1.00

Odds Ratios Examples The odds ratio below was produced from a model (with more than one independent variable) constructed to estimate the association between having a child passenger in the vehicle and speeding among drivers ages 16-55 years old. Employed driver: adjOR=1.77; 95% CI=1.48-2.12 1.77 is greater than 1.00; speeding is more common among employed drivers versus unemployed drivers The odds of speeding were 77% higher (1.77=1.00=0.77) among employed drivers compared to unemployed drivers Said another way, the odds of speeding among employed drivers are 1.77 times the odds of speeding among unemployed drivers This estimate is statistically significant because the 95% CI does not include 1.00

Odds Ratios A thorough and accessible discussion of the interpretation of odds ratio in logistic regression can be found at the UCLA Institute for Digital Research and Education website: https://stats.idre.ucla.edu/other/mult- pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/