SQL for Calculating Likelihood Ratios

Slides:

Advertisements

Similar presentations

Lecture 3: Chi-Sqaure, correlation and your dissertation proposal Non-parametric data: the Chi-Square test Statistical correlation and regression: parametric.

Advertisements

Measure of disease frequency

Chapter 10 - Part 1 Factorial Experiments.

Introduction to Survival Analysis PROC LIFETEST and Survival Curves.

Unit 4: Monitoring Data Quality For HIV Case Surveillance Systems #6-0-1.

Go to Table of ContentTable of Content Analysis of Variance: Randomized Blocks Farrokh Alemi Ph.D. Kashif Haqqi M.D.

1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.

Categorical Data Prof. Andy Field.

Chapter 13: Inference in Regression

QNT 531 Advanced Problems in Statistics and Research Methods

Slide 26-1 Copyright © 2004 Pearson Education, Inc.

Please turn off cell phones, pagers, etc. The lecture will begin shortly.

Slide 1 Copyright © 2004 Pearson Education, Inc..

1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.

Copyright © 2009 Pearson Education, Inc LEARNING GOAL Interpret and carry out hypothesis tests for independence of variables with data organized.

Comparing Counts Chi Square Tests Independence.

BINARY LOGISTIC REGRESSION

April 18 Intro to survival analysis Le 11.1 – 11.2

Relative Values.

The binomial applied: absolute and relative risks, chi-square

Association between two categorical variables

Chapter 12 Tests with Qualitative Data

Measures of Association

Data Collection Principles

Chapter 25 Comparing Counts.

Elementary Statistics

Probability Distributions for Discrete Variables

Dead Man Visiting Farrokh Alemi, PhD Narrated by …

SQL Text Manipulation Farrokh Alemi, Ph.D.

Elementary Statistics

Probability Calculus Farrokh Alemi Ph.D.

Graphical Interface for Queries

Observations, Variables and Data Matrices

GROUP BY & Subset Data Analysis

SQL for Predicting from Likelihood Ratios

AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…

Types of Joins Farrokh Alemi, Ph.D.

SQL for Cleaning Data Farrokh Alemi, Ph.D.

Calculating Product of Values in Same Column

Receiver Operating Curves

Saturday, August 06, 2016 Farrokh Alemi, PhD.

Test of Independence in 3 Variables

Rank Order Function Farrokh Alemi, Ph.D.

Date Functions Farrokh Alemi, Ph.D.

Creating Tables & Inserting Values Using SQL

Procedures Organized by Farrokh Alemi, Ph.D. Narrated by Yara Alemi

Comparing two Rates Farrokh Alemi Ph.D.

Constructing a Multi-Morbidity Index from Simulated Data

Hypothesis tests for the difference between two proportions

Cursors Organized by Farrokh Alemi, Ph.D. Narrated by Yara Alemi

Dead Patients Visiting

Multivariate Analysis Project

One-Way Analysis of Variance

Analysis of Frequencies

Wednesday, September 21, 2016 Farrokh Alemi, PhD.

5.1: Randomness, Probability and Simulation

Indexing & Computational Efficiency

Selecting the Right Predictors

Propagation Algorithm in Bayesian Networks

Chapter 26 Comparing Counts.

Multiple Regression – Split Sample Validation

Improving Overlap Farrokh Alemi, Ph.D.

Improving Overlap Based on Expected Value

Chapter 4 SURVIVAL AND LIFE TABLES

Xbar Chart By Farrokh Alemi Ph.D

Chapter 26 Comparing Counts.

Risk Adjusted P-chart Farrokh Alemi, Ph.D.

Stratified Covariate Balancing Using R

Categorical Data By Farrokh Alemi, Ph.D.

Tukey Control Chart Farrokh Alemi, Ph.D.

Presentation transcript:

SQL for Calculating Likelihood Ratios Farrokh Alemi, Ph.D. This section provides the SQL for how to calculate likelihood ratios. This brief presentation was organized by Dr. Alemi. It was narrated by xxx

Suppose we want to predict patients odds of mortality from the patients diagnoses. We can use naïve Bayes to make these predictions. To do so, we need to calculate the likelihood ratios associated with each diagnosis. Since there are thousands of diagnoses, one has a simple procedure to calculate these likelihood ratios within the electronic health record. It would be difficult to do this using logistic regression or other statistical tools. Something simple is needed that can do this within electronic health record. This section walks you through a SQL for doing this within electronic health records.

Under assumption of independence, we can calculate the likelihood ratio associated with each diagnosis as the ratio of prevalence of the diagnosis among dead and alive patients.

1 Four counts are needed. The number dead

2 and the number with diagnosis among dead patients

3 The number alive

4 and the number with the diagnosis among alive patients. We show how these counts are calculated in the SQL code.

4 and the number with the diagnosis among alive patients. We show how these counts are calculated in the SQL code.

Exceptions Exist In addition, the likelihood ratio is not defined when we are dividing by zero or when the ratio is zero, the SQL code shows how to avoid these situations.

No removing duplicates No Rank order of diagnoses Not All of Code is Shown No cleaning of the data No removing duplicates No Rank order of diagnoses In the following SQL we do not show all of the code. The code for cleaning the data and removing duplicates is not shown. Some intermediary steps may not be shown, use the attached code to understand the ideas and not as an exact code for your assignments.

Print 'Generate a random number' Drop table #tmp1 SELECT DISTINCT [ScrSSN], Rand(cast(newid() as varbinary)) AS RR INTO #tmp1 FROM [Src[CohortScrSSN] Go -- (948,236 row(s) affected) Print 'Select 80% of cases' DROP TABLE #tmp2 SELECT ScrSSN INTO #tmp2 FROM #tmp1 WHERE RR<.8 Go -- (759107 row(s) affected) Set Training Data First step is to create the training data set. Split data into training and validation sets. Note in this code the random number generator needs a seed that is different for each person, hence we have used the patient’s ID as seed. Since the seed needs to be a number, we generate new ID and cast it as a binary variable.

Print 'Generate a random number' Drop table #tmp1 SELECT DISTINCT [ScrSSN], Rand(cast(newid() as varbinary)) AS RR INTO #tmp1 FROM [Src[CohortScrSSN] Go -- (948,236 row(s) affected) Print 'Select 80% of cases' DROP TABLE #tmp2 SELECT ScrSSN INTO #tmp2 FROM #tmp1 WHERE RR<.8 Go -- (759107 row(s) affected) We randomly select 80% of cases for training and the remainder for validation. The likelihood ratios are estimated from the training data set and cross-validated on the 20% of data that we set aside.

Has Patient Died within 6 Months of Diagnosis? Print ‘Has patient died within 6 months of diagnosis?' DROP TABLE #Dx0 SELECT ssnID, i.SCRSSN, i.sta3n , admitdatetime, lastEnc, deathdate, icd9sid , iif(deathdate is null , iif(datediff("dd", admitdatetime, LastEnc)<182, null,0) , iif(datediff("dd", admitdatetime, deathdate)<182,1,0)) AS Dead182 , datediff("dd", birthdate, [AdmitDateTime]) AS AdmAgeInDays , datediff("yy", birthdate, [AdmitDateTime]) AS AdmAgeInYears INTO #Dx0 FROM #tcases t left join [Src[CohortCrosswalk] c on t.ssnID = c.scrssn left join [Src[Inpat_InpatientDiagnosis] i on c.SCRSSN=i.SCRSSN Go -- (35889752 row(s) affected) Diagnoses Has Patient Died within 6 Months of Diagnosis? In these steps we are checking to see if the patient has died within 6 months of the diagnosis.

Print ' Has patient died within 6 months of diagnosis?' DROP TABLE #Dx0 SELECT ssnID, i.SCRSSN, i.sta3n , admitdatetime, lastEnc, deathdate, icd9sid , iif(deathdate is null , iif(datediff("dd", admitdatetime, LastEnc)<182, null,0) , iif(datediff("dd", admitdatetime, deathdate)<182,1,0)) AS Dead182 INTO #Dx0 FROM #tcases t left join [Src[CohortCrosswalk] c on t.ssnID = c.scrssn left join [Src[Inpat_InpatientDiagnosis] i on c.SCRSSN=i.SCRSSN Go -- (35889752 row(s) affected) Diagnoses In these steps we are checking to see if the patient has died within 6 months of the diagnosis. First, for patients who have not died at all and whose date of death is null, we assign 0, indicating that the patient has not died within 6 months of the diagnosis.

Print 'Has patient died within 6 months of diagnosis?' DROP TABLE #Dx0 SELECT ssnID, i.SCRSSN, i.sta3n , admitdatetime, lastEnc, deathdate, icd9sid , iif(deathdate is null , iif(datediff("dd", admitdatetime, LastEnc)<182, null,0) , iif(datediff("dd", admitdatetime, deathdate)<182,1,0)) AS Dead182 INTO #Dx0 FROM #tcases t left join [Src[CohortCrosswalk] c on t.ssnID = c.scrssn left join [Src[Inpat_InpatientDiagnosis] i on c.SCRSSN=i.SCRSSN And Go -- (35889752 row(s) affected) Diagnoses Second, for patients whose date of diagnosis, here shown as date of admission, is within 182 days of date of last encounter, we assign a null value. We do not have a 6 month follow up period for these patients and therefore cannot use their information..

Print 'Has patient died within 6 months of diagnosis?' DROP TABLE #Dx0 SELECT ssnID, i.SCRSSN, i.sta3n , admitdatetime, lastEnc, deathdate, icd9sid , iif(deathdate is null , iif(datediff("dd", admitdatetime, LastEnc)<182, null,0) , iif(datediff("dd", admitdatetime, deathdate)<182,1,0)) AS Dead182 INTO #Dx0 FROM #tcases t left join [Src[CohortCrosswalk] c on t.ssnID = c.scrssn left join [Src[Inpat_InpatientDiagnosis] i on c.SCRSSN=i.SCRSSN And Go -- (35889752 row(s) affected) Diagnoses Third, if the admit date and date of death are within 182 days, then we assign a 1. We call this variable Dead within 182 days.

Print 'Has patient died within 6 months of diagnosis?' DROP TABLE #Dx0 SELECT ssnID, i.SCRSSN, i.sta3n , admitdatetime, lastEnc, deathdate, icd9sid , iif(deathdate is null , iif(datediff("dd", admitdatetime, LastEnc)<182, null,0) , iif(datediff("dd", admitdatetime, deathdate)<182,1,0)) AS Dead182 INTO #Dx0 FROM #tcases t left join [Src[CohortCrosswalk] c on t.ssnID = c.scrssn left join [Src[Inpat_InpatientDiagnosis] i on c.SCRSSN=i.SCRSSN Go -- (35889752 row(s) affected) Diagnoses Note that we get these data from training cases. We select from our cohort the cases that match with ID to training casee selected by ID.

Print 'Has patient died within 6 months of diagnosis?' DROP TABLE #Dx0 SELECT ssnID, i.SCRSSN, i.sta3n , admitdatetime, lastEnc, deathdate, icd9sid , iif(deathdate is null , iif(datediff("dd", admitdatetime, LastEnc)<182, null,0) , iif(datediff("dd", admitdatetime, deathdate)<182,1,0)) AS Dead182 INTO #Dx0 FROM #tcases t left join [Src[CohortCrosswalk] c on t.ssnID = c.scrssn left join [Src[Inpat_InpatientDiagnosis] i on c.SCRSSN=i.SCRSSN Go -- (35889752 row(s) affected) Diagnoses We get the diagnoses from the inpatient encounter file. The code assumes that you have removed duplicate diagnoses within the encounter file, otherwise this join will result in many duplicates.

Print 'Calculate co-occurrences of repeated diagnosis and death' DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) In this step we count the number of time the diagnosis co-occurs with death in 182 days. The analysis is done for repetition of each diagnosis. We have not shown how the repetition variable is organized in these slides, an earlier introduction to rank function describes the process. The GROUP BY command allows us to count for each diagnosis and its repetition a separate set of statistics.

Print 'Calculate co-occurrences of repeated diagnosis and death' DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) We are also ignoring any diagnosis that does not repeat at least in 29 distinct patients.

Print 'Calculate co-occurrences of repeated diagnosis and death' DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) For each repeated diagnosis, we calculate the number of times the diagnosis occurs among dead patients

Print 'Calculate co-occurrences of repeated diagnosis and death' DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) And among alive patients.

First Remove Duplicates Print 'Calculate co-occurrences of repeated diagnosis and death' DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) First Remove Duplicates Keep in mind that these sums will not be accurate if we have not previously removed duplicate IDs and duplicate diagnoses codes for same person.

Dead in 182 Days Yes No Total Dx & Repeat Present Absent Print 'Calculate co-occurrences of repeated diagnosis and death' DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) Dead in 182 Days Yes No Total Dx & Repeat Present a b a+b Absent c d c+d a+c b+d a+b+c+d We can show what we have calculated as cell values in a contingency table. The rows indicate whether the repeated diagnosis is present or absent and the columns indicate if the patient had died within 182 days.

Dead in 182 Days Yes No Total Dx & Repeat Present Absent Print 'Calculate co-occurrences of repeated diagnosis and death' DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) Dead in 182 Days Yes No Total Dx & Repeat Present a b a+b Absent c d c+d a+c b+d a+b+c+d Using the contingency table, we can see that number dead with repeated diagnosis is the cell value shown as “a” in the contingency table.

Dead in 182 Days Yes No Total Dx & Repeat Present Absent Print 'Calculate co-occurrences of repeated diagnosis and death' DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) Dead in 182 Days Yes No Total Dx & Repeat Present a b a+b Absent c d c+d a+c b+d a+b+c+d The number alive with repeated diagnosis is the cell value shown as “b” in the contingency table.

Dead in 182 Days Yes No Total Dx & Repeat Present Absent c d c+d a+c b+d a+b+c+d So far we have calculated two cell values in our contingency table.

Dead in 182 Days Yes No Total Dx & Repeat Present Absent c d c+d a+c b+d a+b+c+d The remaining cell values can be calculated from the difference between calculated cell values and total values. For example, the patients who do not have the repeated diagnoses and are total dead in 182 days, i.e. c in the table, can be calculated by our estimate for a and the total dead. Making sure that total dead is calculated for distinct patients.

Print 'Calculate Likelihood Ratio for repeated diagnosis' DECLARE @nDead int, @nAlive Int SET @nDead= (count(distinct id) FROM #tDx WHERE dead182=1) SET @nAlive= (count(distinct id) FROM #tDx WHERE dead182=0) DROP TABLE Dflt.LR SELECT Diagnosis, Repeated, , iif(nDeadAndDx=0, 1.0/cast((nDx+1) as float) , iif(nAliveAndDx=0, nDx+1, (cast(nDeadandDx as float)/cast(@nDead as float))/(cast(nAliveandDx as float)/cast(@nAlive as float)))) as LR INTO Dflt.LR FROM #dx4 Go -- (32118 row(s) affected) In this step we calculate the likelihood ratios for each repeated diagnosis.

Print 'Calculate Likelihood Ratio for repeated diagnosis' DECLARE @nDead int, @nAlive Int SET @nDead= (count(distinct id) FROM #tDx WHERE dead182=1) SET @nAlive= (count(distinct id) FROM #tDx WHERE dead182=0) DROP TABLE Dflt.LR SELECT Diagnosis, Repeated, , iif(nDeadAndDx=0, 1.0/cast((nDx+1) as float) , iif(nAliveAndDx=0, nDx+1, (cast(nDeadandDx as float)/cast(@nDead as float))/(cast(nAliveandDx as float)/cast(@nAlive as float)))) as LR INTO Dflt.LR FROM #dx4 Go -- (32118 row(s) affected) First, we calculate the totals. We start by declaring the totals as integer constants.

Print 'Calculate Likelihood Ratio for repeated diagnosis' DECLARE @nDead int, @nAlive Int SET @nDead= (count(distinct id) FROM #tDx WHERE dead182=1) SET @nAlive= (count(distinct id) FROM #tDx WHERE dead182=0) DROP TABLE Dflt.LR SELECT Diagnosis, Repeated, , iif(nDeadAndDx=0, 1.0/cast((nDx+1) as float) , iif(nAliveAndDx=0, nDx+1, (cast(nDeadandDx as float)/cast(@nDead as float))/(cast(nAliveandDx as float)/cast(@nAlive as float)))) as LR INTO Dflt.LR FROM #dx4 Go -- (32118 row(s) affected) Second, total dead is calculated where dead182 is 1. Note that the count is for distinct patient IDs.

Print 'Calculate Likelihood Ratio for repeated diagnosis' DECLARE @nDead int, @nAlive Int SET @nDead= (count(distinct id) FROM #tDx WHERE dead182=1) SET @nAlive= (count(distinct id) FROM #tDx WHERE dead182=0) DROP TABLE Dflt.LR SELECT Diagnosis, Repeated, , iif(nDeadAndDx=0, 1.0/cast((nDx+1) as float) , iif(nAliveAndDx=0, nDx+1, (cast(nDeadandDx as float)/cast(@nDead as float))/(cast(nAliveandDx as float)/cast(@nAlive as float)))) as LR INTO Dflt.LR FROM #dx4 Go -- (32118 row(s) affected) Third, total alive is calculated where dead182 is zero. Note that the count is again for distinct patient IDs.

Print 'Calculate Likelihood Ratio for repeated diagnosis' DECLARE @nDead int, @nAlive Int SET @nDead= (count(distinct id) FROM #tDx WHERE dead182=1) SET @nAlive= (count(distinct id) FROM #tDx WHERE dead182=0) DROP TABLE Dflt.LR SELECT Diagnosis, Repeated, , iif(nDeadAndDx=0, 1.0/cast((nDx+1) as float) , iif(nAliveAndDx=0, nDx+1, (cast(nDeadandDx as float)/cast(@nDead as float))/(cast(nAliveandDx as float)/cast(@nAlive as float)))) as LR INTO Dflt.LR FROM #dx4 Go -- (32118 row(s) affected) Now we are ready to calculate the likelihood ratios. First we report the exceptions to our method of calculation of the likelihood ratio. In the first line we are saying if there are no patients with diagnosis who died in 182 days, then calculate the likelihood ratio as 1 divided by 1 plus the total number of patients with the diagnosis. This guarantees a number close to 0 that is proportional to sample size.

Print 'Calculate Likelihood Ratio for repeated diagnosis' DECLARE @nDead int, @nAlive Int SET @nDead= (count(distinct id) FROM #tDx WHERE dead182=1) SET @nAlive= (count(distinct id) FROM #tDx WHERE dead182=0) DROP TABLE Dflt.LR SELECT Diagnosis, Repeated, , iif(nDeadAndDx=0, 1.0/cast((nDx+1) as float) , iif(nAliveAndDx=0, nDx+1, (cast(nDeadandDx as float)/cast(@nDead as float))/(cast(nAliveandDx as float)/cast(@nAlive as float)))) as LR INTO Dflt.LR FROM #dx4 Go -- (32118 row(s) affected) Next we take care of the situation where all the patients with the diagnosis were alive. In this situation the likelihood ratio is calculated as 1 plus the number of cases with the diagnosis.

Print 'Calculate Likelihood Ratio for repeated diagnosis' DECLARE @nDead int, @nAlive Int SET @nDead= (count(distinct id) FROM #tDx WHERE dead182=1) SET @nAlive= (count(distinct id) FROM #tDx WHERE dead182=0) DROP TABLE Dflt.LR SELECT Diagnosis, Repeated, , iif(nDeadAndDx=0, 1.0/cast((nDx+1) as float) , iif(nAliveAndDx=0, nDx+1, (cast(nDeadandDx as float)/cast(@nDead as float))/(cast(nAliveandDx as float)/cast(@nAlive as float)))) as LR INTO Dflt.LR FROM #dx4 Go -- (32118 row(s) affected) Finally we calculate the likelihood ratio as proportion of dead people who have the diagnosis divided by the proportion of the alive people who have the same diagnosis. Note for these calculations to work we need to cast all integer values into floats.

SQL can be used to calculate likelihood ratios for thousands of predictors These slides have shown the ideas behind how SQL can be used to calculate the likelihood ratio for thousands of predictors.