1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes
2 Overview Ngrams Ngrams Ngram Statistics for Spelling Correction Ngram Statistics for Spelling Correction Spelling Correction Spelling Correction Ngram Statistics for Multi Term Identification Ngram Statistics for Multi Term Identification Multi Term Identification Multi Term Identification
3 Ngram Her dobutamine stress echo showed mild aortic stenosis with a subaortic gradient. Her dobutamine Dobutamine stress Stress echo Echo showed Showed mild Mild aortic Aortic stenosis Stenosis with With a A subaortic Subaortic gradient her dobutamine stress dobutamine stress echo stress echo showed echo showed mild showed mild aortic mild aortic stenosis aortic stenosis with stenosis with a a subaortic gradient BigramsTrigrams
4 Contingency Tables Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp n11 = the joint frequency of word1 and word2 n12 = the frequency word 1 occurs and word 2 does not n21 = the frequency word 2 occurs and word 1 does not n22 = the frequency word 1 and word 2 do not occur npp = the total number of ngrams n1p, np1, np2, n2p are the marginal counts
5 Contingency Tables echo! echo stress !stress Her dobutamine1 Dobutamine stress1 Stress echo1 Echo showed1 Showed mild1 Mild aortic1 Aortic stenosis1 Stenosis with1 With a 1 A subaortic1 Subaortic gradient1
6 Contingency Tables Expected Values Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp Expected Values m11 = (np1 * n1p) / npp m12 = (np2 * n1p) / npp m21 = (np1 * n2p) / npp m22 = (np2 * n2p) / npp
7 Contingency Tables echo! echo stress !stress Expected Values m11 = ( 1 * 1 ) / 11 = 0.09 m12 = ( 1 * 10) / 11 = 0.91 m21 = ( 1 * 10) / 11 = 0.90 m22 = (10 * 10) / 11 = 9.09 What is this telling you? ‘this is’ occurs twice in our example. The expected occurrence of ‘this is’ if they are independent is.09 (m11).
8 Ngram Statistics Measures of Association Measures of Association Log Likelihood Ratio Log Likelihood Ratio Chi Squared Test Chi Squared Test Odds Ratio Odds Ratio Phi Coefficient Phi Coefficient T-Score T-Score Dice Coefficient Dice Coefficient True Mutual Information True Mutual Information
9 Log Likelihood Ratio Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp Log Likelihood = 2 * ∑ ( nij * log( nij / mij) ) The log likelihood ratio measures the difference between the observed values and the expected values. It is the sum of the ratio of the observed and expected values
10 Chi Squared Test Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp x2 = ∑ pow( (nij – mij), 2) / mij The chi squared test also measures the difference between the observed values and the expected values. It is the sum of the difference between the observed and expected values
11 Odds Ratio Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp Odds Ratio = (n11 * n22) / (n21 * n12) The odds ratio is the ratio is the total number of times an event takes place to the total number of times that it does not take place. It is the cross product ratio of the 2x2 contingency table and measures the magnitude of association between two words
12 Phi Coefficient Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp Phi = ( (n11 * n22) - (n21 * n12) ) / Sqrt(np1 * n1p * n2p * np2) The bigrams are considered positively associated if most of data is along the diagonal (meaning if n11 and n22 are larger than n12 and n21) and negatively associated if most of the data falls off the diagonal.
13 T Score Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp T Score = ( n11 – m11 ) / sqrt( n11 ) The tscore determines whether there is some non random association between two words. It is the quotient of your known and expected divided by the square root of your known
14 Dice Coefficient Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp Dice coefficient = 2 * n11 / (np1 + n1p) The dice coefficient depends on the frequency of the events occurring together and their individual frequencies.
15 True Mutual Information Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp TMI = (nij / npp) * ∑ log( nij / mij) True Mutual Information measures to what extent the observed frequencies differ from the expected.
16 Spelling Correction Using context sensitive information through the bigrams to determine the ranking of a given set of possible spelling corrections for a misspelled word. Using context sensitive information through the bigrams to determine the ranking of a given set of possible spelling corrections for a misspelled word. Given: Given: First content word prior to the misspelled word First content word prior to the misspelled word First content word after the misspelled word First content word after the misspelled word List of possible spelling corrections List of possible spelling corrections
17 Spelling Correction Example Example Sentence: Example Sentence: Her dobutamine stress echo showed mild aurtic stenosis with a subaortic gradient. Her dobutamine stress echo showed mild aurtic stenosis with a subaortic gradient. List of Possible corrections: List of Possible corrections: artic artic aortic aortic Statistical Analysis : Statistical Analysis : Basic Idea Basic Idea herdobutaminestressechoshowedmildPOSstenosiswithsubaorticgradient
18 Spelling Correction Statistics mild artic 0.40 artic stenosis 0.03 Weighted average mild aortic 0.66 aortic stenosis 0.30 Weighted average 0.46 Possible 1 :Possible 2: This allows us to take into consideration finding a bigram with word prior to the misspelling and after the misspelling The possible word with its score are then returned
19 Types of Results Types of Results Types of Results Gspell only Gspell only Context sensitive only Context sensitive only Hybrid of both Gspell and Context Hybrid of both Gspell and Context Taking the average of the Gspell and context sensitive scores Taking the average of the Gspell and context sensitive scores Note : this turns into a backoff method when no statistical data is found for any of the possibilities Note : this turns into a backoff method when no statistical data is found for any of the possibilities Backoff method Backoff method Use only the context sensitive score unless it does not exists then revert to the Gspell score Use only the context sensitive score unless it does not exists then revert to the Gspell score
20 Preliminary Test Set Test set : partially scrubbed clinical notes Test set : partially scrubbed clinical notes Size : 854 words Size : 854 words Number of misspellings : 82 Number of misspellings : 82 Includes Abbreviations Includes Abbreviations
21 Preliminary Results GSPELLPrecisionRecallFmeasure Measure of Association PrecisionRecallFmeasurePHI LL TMI ODDS X TSCORE DICE GSPELL Results : Context Sensitive Results:
22 Preliminary Results Measure of association PrecisionRecallFmeasure PHI LL TMI ODDS X TSCORE DICE Hybrid Method Results:
23 Notes on Log Likelihood Log Likelihood is used quite often with context sensitive spelling correction Log Likelihood is used quite often with context sensitive spelling correction Problem with large sample sizes Problem with large sample sizes The marginal values are very large due to the sample size The marginal values are very large due to the sample size Increases the expected values so the actually values are commonly so much lower than the expected values Increases the expected values so the actually values are commonly so much lower than the expected values Very independent and very dependent ngrams end up with the same value Very independent and very dependent ngrams end up with the same value Noticed similar characteristics with true mutual information Noticed similar characteristics with true mutual information
24 Example of Problem hip! hip follow ! follow n n11 Log Likelihood
25 Conclusions with Preliminary Results Dice coefficient returns the best results Dice coefficient returns the best results Phi coefficient returns the second best Phi coefficient returns the second best Log Likelihood and True Mutual Information should not be used Log Likelihood and True Mutual Information should not be used Need to now test the program with a more extensive test bed which is in the process of being created Need to now test the program with a more extensive test bed which is in the process of being created
26 NGram Statistics for Multi Term Identification Can not use previous statistics package Can not use previous statistics package Memory constraints due to the amount of data Memory constraints due to the amount of data Would like to look for longer ngrams Would like to look for longer ngrams Alternative : Suffix Arrays (Church and Yamamoto) Alternative : Suffix Arrays (Church and Yamamoto) Reduces the amount of memory Reduces the amount of memory Two Arrays Two Arrays Contains the corpus Contains the corpus Contains identifiers to the ngrams in the corpus Contains identifiers to the ngrams in the corpus Two Stacks Two Stacks Contains the longest common prefix Contains the longest common prefix Contains the document frequency Contains the document frequency Allows for ngrams up to the size of the corpus to be found Allows for ngrams up to the size of the corpus to be found
27 Suffix Arrays tobeornottobe To be or not to be to be or not to be be or not to be or not to be not to be to be be Each array element is considered a suffix A Ngram is from a suffix until the end of the array
28 Suffix Arrays to be or not to be be or not to be or not to be not to be to be be [0] = 5 => be [1] = 1 => be or not to be [2] = 3 => not to be [3] = 2 => or not to be [4] = 4 => to be [5] = 0 => to be or not to be Actual Suffix Array :
29 Term Frequency Term frequency (tf) is the number of times a ngram occurs in the corpus Term frequency (tf) is the number of times a ngram occurs in the corpus To determine the tf of an ngram: To determine the tf of an ngram: Sorted the suffix array Sorted the suffix array tf = j – i + 1 tf = j – i + 1 j = first occurrence j = first occurrence i = last occurrence i = last occurrence [0] = 5 => be [1] = 1 => be or not to be [2] = 3 => not to be [3] = 2 => or not to be [4] = 4 => to be [5] = 0 => to be or not to be
30 Measures of Association Residual Inverse Document Frequency (RIDF) Residual Inverse Document Frequency (RIDF) RIDF = - log (df / D) + log(1 – exp(-tf/D) ) RIDF = - log (df / D) + log(1 – exp(-tf/D) ) Compares the distribution of a term over documents to what would be expected by a random term Compares the distribution of a term over documents to what would be expected by a random term Mutual Information (MI) Mutual Information (MI) MI(xYz) = log tf( xYz ) * tf( Y ) MI(xYz) = log tf( xYz ) * tf( Y ) tf( xY) * tf( Yz ) tf( xY) * tf( Yz ) Compares the frequency of the whole to the frequency of the parts Compares the frequency of the whole to the frequency of the parts
31 Present Work Calculated the MI and RIDF for the clinical notes for each of the possible sections: CC, CM, IP, HPI, PSH, SH and DX Calculated the MI and RIDF for the clinical notes for each of the possible sections: CC, CM, IP, HPI, PSH, SH and DX Retrieved the respective text for each heading Retrieved the respective text for each heading Calculate the ridf and mi each possible ngrams with a term frequency greater than 10 for the data under each sections Calculate the ridf and mi each possible ngrams with a term frequency greater than 10 for the data under each sections Noticed that different multi terms appear for each of the different sections Noticed that different multi terms appear for each of the different sections
32 Conclusions Ngram statistics can be applied directly and indirectly to various problems Ngram statistics can be applied directly and indirectly to various problems Directly Directly Spelling correction Spelling correction Compound word identification Compound word identification Term extraction Term extraction Name identification Name identification Indirectly Indirectly Part of Speech tagging Part of Speech tagging Information Retrieval Information Retrieval Data Mining Data Mining
33 Packages Two Statistical Packages Two Statistical Packages Contingency Table approach Contingency Table approach Measures for bigrams Measures for bigrams Log Likelihood, True Mutual Information, Chi Squared Test, 0dds Ratio, Phi Coefficient, T Score, and Dice Coefficient Log Likelihood, True Mutual Information, Chi Squared Test, 0dds Ratio, Phi Coefficient, T Score, and Dice Coefficient Measures for trigrams Measures for trigrams Log Likelihood and True Mutual Information Log Likelihood and True Mutual Information Suffix Array approach Suffix Array approach Measures for all lengths of ngrams Measures for all lengths of ngrams Residual Inverse Document Frequency and Mutual Information Residual Inverse Document Frequency and Mutual Information