Presentation is loading. Please wait.

Presentation is loading. Please wait.

SQL for Calculating Likelihood Ratios

Similar presentations


Presentation on theme: "SQL for Calculating Likelihood Ratios"— Presentation transcript:

1 SQL for Calculating Likelihood Ratios
Farrokh Alemi, Ph.D. This section provides the SQL for how to calculate likelihood ratios. This brief presentation was organized by Dr. Alemi. It was narrated by xxx

2 Suppose we want to predict patients odds of mortality from the patients diagnoses. We can use naïve Bayes to make these predictions. To do so, we need to calculate the likelihood ratios associated with each diagnosis. Since there are thousands of diagnoses, one has a simple procedure to calculate these likelihood ratios within the electronic health record. It would be difficult to do this using logistic regression or other statistical tools. Something simple is needed that can do this within electronic health record. This section walks you through a SQL for doing this within electronic health records.

3 Under assumption of independence, we can calculate the likelihood ratio associated with each diagnosis as the ratio of prevalence of the diagnosis among dead and alive patients.

4 1 Four counts are needed. The number dead

5 2 and the number with diagnosis among dead patients

6 3 The number alive

7 4 and the number with the diagnosis among alive patients. We show how these counts are calculated in the SQL code.

8 4 and the number with the diagnosis among alive patients. We show how these counts are calculated in the SQL code.

9 Exceptions Exist In addition, the likelihood ratio is not defined when we are dividing by zero or when the ratio is zero, the SQL code shows how to avoid these situations.

10 No removing duplicates No Rank order of diagnoses
Not All of Code is Shown No cleaning of the data No removing duplicates No Rank order of diagnoses In the following SQL we do not show all of the code. The code for cleaning the data and removing duplicates is not shown. Some intermediary steps may not be shown, use the attached code to understand the ideas and not as an exact code for your assignments.

11 Print 'Generate a random number' Drop table #tmp1
SELECT DISTINCT [ScrSSN], Rand(cast(newid() as varbinary)) AS RR INTO #tmp1 FROM [Src[CohortScrSSN] Go -- (948,236 row(s) affected) Print 'Select 80% of cases' DROP TABLE #tmp2 SELECT ScrSSN INTO #tmp2 FROM #tmp1 WHERE RR<.8 Go -- ( row(s) affected) Set Training Data First step is to create the training data set. Split data into training and validation sets. Note in this code the random number generator needs a seed that is different for each person, hence we have used the patient’s ID as seed. Since the seed needs to be a number, we generate new ID and cast it as a binary variable.

12 Print 'Generate a random number' Drop table #tmp1
SELECT DISTINCT [ScrSSN], Rand(cast(newid() as varbinary)) AS RR INTO #tmp1 FROM [Src[CohortScrSSN] Go -- (948,236 row(s) affected) Print 'Select 80% of cases' DROP TABLE #tmp2 SELECT ScrSSN INTO #tmp2 FROM #tmp1 WHERE RR<.8 Go -- ( row(s) affected) We randomly select 80% of cases for training and the remainder for validation. The likelihood ratios are estimated from the training data set and cross-validated on the 20% of data that we set aside.

13 Has Patient Died within 6 Months of Diagnosis?
Print ‘Has patient died within 6 months of diagnosis?' DROP TABLE #Dx0 SELECT ssnID, i.SCRSSN, i.sta3n , admitdatetime, lastEnc, deathdate, icd9sid , iif(deathdate is null , iif(datediff("dd", admitdatetime, LastEnc)<182, null,0) , iif(datediff("dd", admitdatetime, deathdate)<182,1,0)) AS Dead182 , datediff("dd", birthdate, [AdmitDateTime]) AS AdmAgeInDays , datediff("yy", birthdate, [AdmitDateTime]) AS AdmAgeInYears INTO #Dx0 FROM #tcases t left join [Src[CohortCrosswalk] c on t.ssnID = c.scrssn left join [Src[Inpat_InpatientDiagnosis] i on c.SCRSSN=i.SCRSSN Go -- ( row(s) affected) Diagnoses Has Patient Died within 6 Months of Diagnosis? In these steps we are checking to see if the patient has died within 6 months of the diagnosis.

14 Print ' Has patient died within 6 months of diagnosis?'
DROP TABLE #Dx0 SELECT ssnID, i.SCRSSN, i.sta3n , admitdatetime, lastEnc, deathdate, icd9sid , iif(deathdate is null , iif(datediff("dd", admitdatetime, LastEnc)<182, null,0) , iif(datediff("dd", admitdatetime, deathdate)<182,1,0)) AS Dead182 INTO #Dx0 FROM #tcases t left join [Src[CohortCrosswalk] c on t.ssnID = c.scrssn left join [Src[Inpat_InpatientDiagnosis] i on c.SCRSSN=i.SCRSSN Go -- ( row(s) affected) Diagnoses In these steps we are checking to see if the patient has died within 6 months of the diagnosis. First, for patients who have not died at all and whose date of death is null, we assign 0, indicating that the patient has not died within 6 months of the diagnosis.

15 Print 'Has patient died within 6 months of diagnosis?' DROP TABLE #Dx0
SELECT ssnID, i.SCRSSN, i.sta3n , admitdatetime, lastEnc, deathdate, icd9sid , iif(deathdate is null , iif(datediff("dd", admitdatetime, LastEnc)<182, null,0) , iif(datediff("dd", admitdatetime, deathdate)<182,1,0)) AS Dead182 INTO #Dx0 FROM #tcases t left join [Src[CohortCrosswalk] c on t.ssnID = c.scrssn left join [Src[Inpat_InpatientDiagnosis] i on c.SCRSSN=i.SCRSSN And Go -- ( row(s) affected) Diagnoses Second, for patients whose date of diagnosis, here shown as date of admission, is within 182 days of date of last encounter, we assign a null value. We do not have a 6 month follow up period for these patients and therefore cannot use their information..

16 Print 'Has patient died within 6 months of diagnosis?' DROP TABLE #Dx0
SELECT ssnID, i.SCRSSN, i.sta3n , admitdatetime, lastEnc, deathdate, icd9sid , iif(deathdate is null , iif(datediff("dd", admitdatetime, LastEnc)<182, null,0) , iif(datediff("dd", admitdatetime, deathdate)<182,1,0)) AS Dead182 INTO #Dx0 FROM #tcases t left join [Src[CohortCrosswalk] c on t.ssnID = c.scrssn left join [Src[Inpat_InpatientDiagnosis] i on c.SCRSSN=i.SCRSSN And Go -- ( row(s) affected) Diagnoses Third, if the admit date and date of death are within 182 days, then we assign a 1. We call this variable Dead within 182 days.

17 Print 'Has patient died within 6 months of diagnosis?' DROP TABLE #Dx0
SELECT ssnID, i.SCRSSN, i.sta3n , admitdatetime, lastEnc, deathdate, icd9sid , iif(deathdate is null , iif(datediff("dd", admitdatetime, LastEnc)<182, null,0) , iif(datediff("dd", admitdatetime, deathdate)<182,1,0)) AS Dead182 INTO #Dx0 FROM #tcases t left join [Src[CohortCrosswalk] c on t.ssnID = c.scrssn left join [Src[Inpat_InpatientDiagnosis] i on c.SCRSSN=i.SCRSSN Go -- ( row(s) affected) Diagnoses Note that we get these data from training cases. We select from our cohort the cases that match with ID to training casee selected by ID.

18 Print 'Has patient died within 6 months of diagnosis?' DROP TABLE #Dx0
SELECT ssnID, i.SCRSSN, i.sta3n , admitdatetime, lastEnc, deathdate, icd9sid , iif(deathdate is null , iif(datediff("dd", admitdatetime, LastEnc)<182, null,0) , iif(datediff("dd", admitdatetime, deathdate)<182,1,0)) AS Dead182 INTO #Dx0 FROM #tcases t left join [Src[CohortCrosswalk] c on t.ssnID = c.scrssn left join [Src[Inpat_InpatientDiagnosis] i on c.SCRSSN=i.SCRSSN Go -- ( row(s) affected) Diagnoses We get the diagnoses from the inpatient encounter file. The code assumes that you have removed duplicate diagnoses within the encounter file, otherwise this join will result in many duplicates.

19 Print 'Calculate co-occurrences of repeated diagnosis and death'
DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) In this step we count the number of time the diagnosis co-occurs with death in 182 days. The analysis is done for repetition of each diagnosis. We have not shown how the repetition variable is organized in these slides, an earlier introduction to rank function describes the process. The GROUP BY command allows us to count for each diagnosis and its repetition a separate set of statistics.

20 Print 'Calculate co-occurrences of repeated diagnosis and death'
DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) We are also ignoring any diagnosis that does not repeat at least in 29 distinct patients.

21 Print 'Calculate co-occurrences of repeated diagnosis and death'
DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) For each repeated diagnosis, we calculate the number of times the diagnosis occurs among dead patients

22 Print 'Calculate co-occurrences of repeated diagnosis and death'
DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) And among alive patients.

23 First Remove Duplicates
Print 'Calculate co-occurrences of repeated diagnosis and death' DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) First Remove Duplicates Keep in mind that these sums will not be accurate if we have not previously removed duplicate IDs and duplicate diagnoses codes for same person.

24 Dead in 182 Days Yes No Total Dx & Repeat Present Absent
Print 'Calculate co-occurrences of repeated diagnosis and death' DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) Dead in 182 Days Yes No Total Dx & Repeat Present a b a+b Absent c d c+d a+c b+d a+b+c+d We can show what we have calculated as cell values in a contingency table. The rows indicate whether the repeated diagnosis is present or absent and the columns indicate if the patient had died within 182 days.

25 Dead in 182 Days Yes No Total Dx & Repeat Present Absent
Print 'Calculate co-occurrences of repeated diagnosis and death' DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) Dead in 182 Days Yes No Total Dx & Repeat Present a b a+b Absent c d c+d a+c b+d a+b+c+d Using the contingency table, we can see that number dead with repeated diagnosis is the cell value shown as “a” in the contingency table.

26 Dead in 182 Days Yes No Total Dx & Repeat Present Absent
Print 'Calculate co-occurrences of repeated diagnosis and death' DROP TABLE #dx1 SELECT icd6, Repeated , count(distinct id) as nDx , sum(dead182) as nDeadAndDx , sum(1-dead182) as nAliveAndDx INTO #dx1 FROM dflt.tdx GROUP BY icd6, Repeated HAVING count(distinct id)>29 Go -- (10928 row(s) affected) Dead in 182 Days Yes No Total Dx & Repeat Present a b a+b Absent c d c+d a+c b+d a+b+c+d The number alive with repeated diagnosis is the cell value shown as “b” in the contingency table.

27 Dead in 182 Days Yes No Total Dx & Repeat Present Absent
c d c+d a+c b+d a+b+c+d So far we have calculated two cell values in our contingency table.

28 Dead in 182 Days Yes No Total Dx & Repeat Present Absent
c d c+d a+c b+d a+b+c+d The remaining cell values can be calculated from the difference between calculated cell values and total values. For example, the patients who do not have the repeated diagnoses and are total dead in 182 days, i.e. c in the table, can be calculated by our estimate for a and the total dead. Making sure that total dead is calculated for distinct patients.

29 Print 'Calculate Likelihood Ratio for repeated diagnosis'
Int SET @nDead= (count(distinct id) FROM #tDx WHERE dead182=1) SET @nAlive= (count(distinct id) FROM #tDx WHERE dead182=0) DROP TABLE Dflt.LR SELECT Diagnosis, Repeated, , iif(nDeadAndDx=0, 1.0/cast((nDx+1) as float) , iif(nAliveAndDx=0, nDx+1, (cast(nDeadandDx as as float))/(cast(nAliveandDx as as float)))) as LR INTO Dflt.LR FROM #dx4 Go -- (32118 row(s) affected) In this step we calculate the likelihood ratios for each repeated diagnosis.

30 Print 'Calculate Likelihood Ratio for repeated diagnosis'
Int SET @nDead= (count(distinct id) FROM #tDx WHERE dead182=1) SET @nAlive= (count(distinct id) FROM #tDx WHERE dead182=0) DROP TABLE Dflt.LR SELECT Diagnosis, Repeated, , iif(nDeadAndDx=0, 1.0/cast((nDx+1) as float) , iif(nAliveAndDx=0, nDx+1, (cast(nDeadandDx as as float))/(cast(nAliveandDx as as float)))) as LR INTO Dflt.LR FROM #dx4 Go -- (32118 row(s) affected) First, we calculate the totals. We start by declaring the totals as integer constants.

31 Print 'Calculate Likelihood Ratio for repeated diagnosis'
Int SET @nDead= (count(distinct id) FROM #tDx WHERE dead182=1) SET @nAlive= (count(distinct id) FROM #tDx WHERE dead182=0) DROP TABLE Dflt.LR SELECT Diagnosis, Repeated, , iif(nDeadAndDx=0, 1.0/cast((nDx+1) as float) , iif(nAliveAndDx=0, nDx+1, (cast(nDeadandDx as as float))/(cast(nAliveandDx as as float)))) as LR INTO Dflt.LR FROM #dx4 Go -- (32118 row(s) affected) Second, total dead is calculated where dead182 is 1. Note that the count is for distinct patient IDs.

32 Print 'Calculate Likelihood Ratio for repeated diagnosis'
Int SET @nDead= (count(distinct id) FROM #tDx WHERE dead182=1) SET @nAlive= (count(distinct id) FROM #tDx WHERE dead182=0) DROP TABLE Dflt.LR SELECT Diagnosis, Repeated, , iif(nDeadAndDx=0, 1.0/cast((nDx+1) as float) , iif(nAliveAndDx=0, nDx+1, (cast(nDeadandDx as as float))/(cast(nAliveandDx as as float)))) as LR INTO Dflt.LR FROM #dx4 Go -- (32118 row(s) affected) Third, total alive is calculated where dead182 is zero. Note that the count is again for distinct patient IDs.

33 Print 'Calculate Likelihood Ratio for repeated diagnosis'
Int SET @nDead= (count(distinct id) FROM #tDx WHERE dead182=1) SET @nAlive= (count(distinct id) FROM #tDx WHERE dead182=0) DROP TABLE Dflt.LR SELECT Diagnosis, Repeated, , iif(nDeadAndDx=0, 1.0/cast((nDx+1) as float) , iif(nAliveAndDx=0, nDx+1, (cast(nDeadandDx as as float))/(cast(nAliveandDx as as float)))) as LR INTO Dflt.LR FROM #dx4 Go -- (32118 row(s) affected) Now we are ready to calculate the likelihood ratios. First we report the exceptions to our method of calculation of the likelihood ratio. In the first line we are saying if there are no patients with diagnosis who died in 182 days, then calculate the likelihood ratio as 1 divided by 1 plus the total number of patients with the diagnosis. This guarantees a number close to 0 that is proportional to sample size.

34 Print 'Calculate Likelihood Ratio for repeated diagnosis'
Int SET @nDead= (count(distinct id) FROM #tDx WHERE dead182=1) SET @nAlive= (count(distinct id) FROM #tDx WHERE dead182=0) DROP TABLE Dflt.LR SELECT Diagnosis, Repeated, , iif(nDeadAndDx=0, 1.0/cast((nDx+1) as float) , iif(nAliveAndDx=0, nDx+1, (cast(nDeadandDx as as float))/(cast(nAliveandDx as as float)))) as LR INTO Dflt.LR FROM #dx4 Go -- (32118 row(s) affected) Next we take care of the situation where all the patients with the diagnosis were alive. In this situation the likelihood ratio is calculated as 1 plus the number of cases with the diagnosis.

35 Print 'Calculate Likelihood Ratio for repeated diagnosis'
Int SET @nDead= (count(distinct id) FROM #tDx WHERE dead182=1) SET @nAlive= (count(distinct id) FROM #tDx WHERE dead182=0) DROP TABLE Dflt.LR SELECT Diagnosis, Repeated, , iif(nDeadAndDx=0, 1.0/cast((nDx+1) as float) , iif(nAliveAndDx=0, nDx+1, (cast(nDeadandDx as as float))/(cast(nAliveandDx as as float)))) as LR INTO Dflt.LR FROM #dx4 Go -- (32118 row(s) affected) Finally we calculate the likelihood ratio as proportion of dead people who have the diagnosis divided by the proportion of the alive people who have the same diagnosis. Note for these calculations to work we need to cast all integer values into floats.

36 SQL can be used to calculate likelihood ratios for thousands of predictors
These slides have shown the ideas behind how SQL can be used to calculate the likelihood ratio for thousands of predictors.


Download ppt "SQL for Calculating Likelihood Ratios"

Similar presentations


Ads by Google