1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes.

Slides:



Advertisements
Similar presentations
CHI-SQUARE(X2) DISTRIBUTION
Advertisements

2013/12/10.  The Kendall’s tau correlation is another non- parametric correlation coefficient  Let x 1, …, x n be a sample for random variable x and.
Determining the Syntactic Structure of Medical Terms in Clinical Notes Bridget T. McInnes¹ Ted Pedersen² and Serguei V. Pakhomov¹ University of Minnesota¹.
1 Contingency Tables: Tests for independence and homogeneity (§10.5) How to test hypotheses of independence (association) and homogeneity (similarity)
Hypothesis Testing IV Chi Square.
Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
QUANTITATIVE DATA ANALYSIS
Heuristic alignment algorithms and cost matrices
Determining the Syntactic Structure of Medical Terms in Clinical Notes Bridget T. McInnes Ted Pedersen Serguei V. Pakhomov
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
Statistics MP Oakes (1998) Statistics for corpus linguistics. Edinburgh University Press.
Topic 2: Statistical Concepts and Market Returns
EdPsy 511 August 28, Common Research Designs Correlational –Do two qualities “go together”. Comparing intact groups –a.k.a. causal-comparative and.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
An Introduction to Logistic Regression
1 Confidence Interval for Population Mean The case when the population standard deviation is unknown (the more common case).
The Chi-square Statistic. Goodness of fit 0 This test is used to decide whether there is any difference between the observed (experimental) value and.
Cross Tabulation and Chi-Square Testing. Cross-Tabulation While a frequency distribution describes one variable at a time, a cross-tabulation describes.
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
AS 737 Categorical Data Analysis For Multivariate
Medical Statistics (full English class) Ji-Qian Fang School of Public Health Sun Yat-Sen University.
EFFECT SIZE Parameter used to compare results of different studies on the same scale in which a common effect of interest (response variable) has been.
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
CHAPTER 15 Simple Linear Regression and Correlation
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
Chapter 9: Non-parametric Tests n Parametric vs Non-parametric n Chi-Square –1 way –2 way.
1 Chi-Square Heibatollah Baghi, and Mastee Badii.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Introduction to Behavioral Statistics Probability, The Binomial Distribution and the Normal Curve.
EDPSY Chp. 2: Measurement and Statistical Notation.
INVESTIGATION 1.
July, 2000Guang Jin Statistics in Applied Science and Technology Chapter 7 - Sampling Distribution of Means.
 IWBAT summarize data, using measures of central tendency, such as the mean, median, mode, and midrange.
Statistical test for Non continuous variables. Dr L.M.M. Nunn.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation.
Section 10.2 Independence. Section 10.2 Objectives Use a chi-square distribution to test whether two variables are independent Use a contingency table.
ENGR 610 Applied Statistics Fall Week 2 Marshall University CITE Jack Smith.
Chapter Eight: Using Statistics to Answer Questions.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
1 Probability and Statistics Confidence Intervals.
Elementary Probability.  Definition  Three Types of Probability  Set operations and Venn Diagrams  Mutually Exclusive, Independent and Dependent Events.
Chapter 13 Understanding research results: statistical inference.
CHAPTER- 3.2 ERROR ANALYSIS. 3.3 SPECIFIC ERROR FORMULAS  The expressions of Equations (3.13) and (3.14) were derived for the general relationship of.
Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
CHI SQUARE DISTRIBUTION. The Chi-Square (  2 ) Distribution The chi-square distribution is the probability distribution of the sum of several independent,
Chi Square Chi square is employed to test the difference between an actual sample and another hypothetical or previously established distribution such.
Cross Tabulation with Chi Square
Chapter 9: Non-parametric Tests
Lecture8 Test forcomparison of proportion
Business Statistics Topic 4
Information Retrieval and Web Search
Active Learning Lecture Slides
AP Biology Intro to Statistics
Qualitative data – tests of association
Basic Statistical Terms
The Chi-Square Test The chi-square test is a statistical test commonly used to compare the observed results of a genetic cross with the expected results.
Exam 5 Review GOVT 201.
Chapter 10 Analyzing the Association Between Categorical Variables
Probability Key Questions
Analyzing the Association Between Categorical Variables
Copyright © Cengage Learning. All rights reserved.
The Binomial Distributions
Presentation transcript:

1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes

2 Overview Ngrams Ngrams Ngram Statistics for Spelling Correction Ngram Statistics for Spelling Correction Spelling Correction Spelling Correction Ngram Statistics for Multi Term Identification Ngram Statistics for Multi Term Identification Multi Term Identification Multi Term Identification

3 Ngram Her dobutamine stress echo showed mild aortic stenosis with a subaortic gradient. Her dobutamine Dobutamine stress Stress echo Echo showed Showed mild Mild aortic Aortic stenosis Stenosis with With a A subaortic Subaortic gradient her dobutamine stress dobutamine stress echo stress echo showed echo showed mild showed mild aortic mild aortic stenosis aortic stenosis with stenosis with a a subaortic gradient BigramsTrigrams

4 Contingency Tables Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp n11 = the joint frequency of word1 and word2 n12 = the frequency word 1 occurs and word 2 does not n21 = the frequency word 2 occurs and word 1 does not n22 = the frequency word 1 and word 2 do not occur npp = the total number of ngrams n1p, np1, np2, n2p are the marginal counts

5 Contingency Tables echo! echo stress !stress Her dobutamine1 Dobutamine stress1 Stress echo1 Echo showed1 Showed mild1 Mild aortic1 Aortic stenosis1 Stenosis with1 With a 1 A subaortic1 Subaortic gradient1

6 Contingency Tables Expected Values Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp Expected Values m11 = (np1 * n1p) / npp m12 = (np2 * n1p) / npp m21 = (np1 * n2p) / npp m22 = (np2 * n2p) / npp

7 Contingency Tables echo! echo stress !stress Expected Values m11 = ( 1 * 1 ) / 11 = 0.09 m12 = ( 1 * 10) / 11 = 0.91 m21 = ( 1 * 10) / 11 = 0.90 m22 = (10 * 10) / 11 = 9.09 What is this telling you? ‘this is’ occurs twice in our example. The expected occurrence of ‘this is’ if they are independent is.09 (m11).

8 Ngram Statistics Measures of Association Measures of Association Log Likelihood Ratio Log Likelihood Ratio Chi Squared Test Chi Squared Test Odds Ratio Odds Ratio Phi Coefficient Phi Coefficient T-Score T-Score Dice Coefficient Dice Coefficient True Mutual Information True Mutual Information

9 Log Likelihood Ratio Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp Log Likelihood = 2 * ∑ ( nij * log( nij / mij) ) The log likelihood ratio measures the difference between the observed values and the expected values. It is the sum of the ratio of the observed and expected values

10 Chi Squared Test Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp x2 = ∑ pow( (nij – mij), 2) / mij The chi squared test also measures the difference between the observed values and the expected values. It is the sum of the difference between the observed and expected values

11 Odds Ratio Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp Odds Ratio = (n11 * n22) / (n21 * n12) The odds ratio is the ratio is the total number of times an event takes place to the total number of times that it does not take place. It is the cross product ratio of the 2x2 contingency table and measures the magnitude of association between two words

12 Phi Coefficient Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp Phi = ( (n11 * n22) - (n21 * n12) ) / Sqrt(np1 * n1p * n2p * np2) The bigrams are considered positively associated if most of data is along the diagonal (meaning if n11 and n22 are larger than n12 and n21) and negatively associated if most of the data falls off the diagonal.

13 T Score Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp T Score = ( n11 – m11 ) / sqrt( n11 ) The tscore determines whether there is some non random association between two words. It is the quotient of your known and expected divided by the square root of your known

14 Dice Coefficient Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp Dice coefficient = 2 * n11 / (np1 + n1p) The dice coefficient depends on the frequency of the events occurring together and their individual frequencies.

15 True Mutual Information Word 2! Word 2 Word 1 ! Word 1 n11n12 n21n22 n1p np1np2 n2p npp TMI = (nij / npp) * ∑ log( nij / mij) True Mutual Information measures to what extent the observed frequencies differ from the expected.

16 Spelling Correction Using context sensitive information through the bigrams to determine the ranking of a given set of possible spelling corrections for a misspelled word. Using context sensitive information through the bigrams to determine the ranking of a given set of possible spelling corrections for a misspelled word. Given: Given: First content word prior to the misspelled word First content word prior to the misspelled word First content word after the misspelled word First content word after the misspelled word List of possible spelling corrections List of possible spelling corrections

17 Spelling Correction Example Example Sentence: Example Sentence: Her dobutamine stress echo showed mild aurtic stenosis with a subaortic gradient. Her dobutamine stress echo showed mild aurtic stenosis with a subaortic gradient. List of Possible corrections: List of Possible corrections: artic artic aortic aortic Statistical Analysis : Statistical Analysis : Basic Idea Basic Idea herdobutaminestressechoshowedmildPOSstenosiswithsubaorticgradient

18 Spelling Correction Statistics mild artic 0.40 artic stenosis 0.03 Weighted average mild aortic 0.66 aortic stenosis 0.30 Weighted average 0.46 Possible 1 :Possible 2: This allows us to take into consideration finding a bigram with word prior to the misspelling and after the misspelling The possible word with its score are then returned

19 Types of Results Types of Results Types of Results Gspell only Gspell only Context sensitive only Context sensitive only Hybrid of both Gspell and Context Hybrid of both Gspell and Context Taking the average of the Gspell and context sensitive scores Taking the average of the Gspell and context sensitive scores Note : this turns into a backoff method when no statistical data is found for any of the possibilities Note : this turns into a backoff method when no statistical data is found for any of the possibilities Backoff method Backoff method Use only the context sensitive score unless it does not exists then revert to the Gspell score Use only the context sensitive score unless it does not exists then revert to the Gspell score

20 Preliminary Test Set Test set : partially scrubbed clinical notes Test set : partially scrubbed clinical notes Size : 854 words Size : 854 words Number of misspellings : 82 Number of misspellings : 82 Includes Abbreviations Includes Abbreviations

21 Preliminary Results GSPELLPrecisionRecallFmeasure Measure of Association PrecisionRecallFmeasurePHI LL TMI ODDS X TSCORE DICE GSPELL Results : Context Sensitive Results:

22 Preliminary Results Measure of association PrecisionRecallFmeasure PHI LL TMI ODDS X TSCORE DICE Hybrid Method Results:

23 Notes on Log Likelihood Log Likelihood is used quite often with context sensitive spelling correction Log Likelihood is used quite often with context sensitive spelling correction Problem with large sample sizes Problem with large sample sizes The marginal values are very large due to the sample size The marginal values are very large due to the sample size Increases the expected values so the actually values are commonly so much lower than the expected values Increases the expected values so the actually values are commonly so much lower than the expected values Very independent and very dependent ngrams end up with the same value Very independent and very dependent ngrams end up with the same value Noticed similar characteristics with true mutual information Noticed similar characteristics with true mutual information

24 Example of Problem hip! hip follow ! follow n n11 Log Likelihood

25 Conclusions with Preliminary Results Dice coefficient returns the best results Dice coefficient returns the best results Phi coefficient returns the second best Phi coefficient returns the second best Log Likelihood and True Mutual Information should not be used Log Likelihood and True Mutual Information should not be used Need to now test the program with a more extensive test bed which is in the process of being created Need to now test the program with a more extensive test bed which is in the process of being created

26 NGram Statistics for Multi Term Identification Can not use previous statistics package Can not use previous statistics package Memory constraints due to the amount of data Memory constraints due to the amount of data Would like to look for longer ngrams Would like to look for longer ngrams Alternative : Suffix Arrays (Church and Yamamoto) Alternative : Suffix Arrays (Church and Yamamoto) Reduces the amount of memory Reduces the amount of memory Two Arrays Two Arrays Contains the corpus Contains the corpus Contains identifiers to the ngrams in the corpus Contains identifiers to the ngrams in the corpus Two Stacks Two Stacks Contains the longest common prefix Contains the longest common prefix Contains the document frequency Contains the document frequency Allows for ngrams up to the size of the corpus to be found Allows for ngrams up to the size of the corpus to be found

27 Suffix Arrays tobeornottobe To be or not to be to be or not to be be or not to be or not to be not to be to be be Each array element is considered a suffix A Ngram is from a suffix until the end of the array

28 Suffix Arrays to be or not to be be or not to be or not to be not to be to be be [0] = 5 => be [1] = 1 => be or not to be [2] = 3 => not to be [3] = 2 => or not to be [4] = 4 => to be [5] = 0 => to be or not to be Actual Suffix Array :

29 Term Frequency Term frequency (tf) is the number of times a ngram occurs in the corpus Term frequency (tf) is the number of times a ngram occurs in the corpus To determine the tf of an ngram: To determine the tf of an ngram: Sorted the suffix array Sorted the suffix array tf = j – i + 1 tf = j – i + 1 j = first occurrence j = first occurrence i = last occurrence i = last occurrence [0] = 5 => be [1] = 1 => be or not to be [2] = 3 => not to be [3] = 2 => or not to be [4] = 4 => to be [5] = 0 => to be or not to be

30 Measures of Association Residual Inverse Document Frequency (RIDF) Residual Inverse Document Frequency (RIDF) RIDF = - log (df / D) + log(1 – exp(-tf/D) ) RIDF = - log (df / D) + log(1 – exp(-tf/D) ) Compares the distribution of a term over documents to what would be expected by a random term Compares the distribution of a term over documents to what would be expected by a random term Mutual Information (MI) Mutual Information (MI) MI(xYz) = log tf( xYz ) * tf( Y ) MI(xYz) = log tf( xYz ) * tf( Y ) tf( xY) * tf( Yz ) tf( xY) * tf( Yz ) Compares the frequency of the whole to the frequency of the parts Compares the frequency of the whole to the frequency of the parts

31 Present Work Calculated the MI and RIDF for the clinical notes for each of the possible sections: CC, CM, IP, HPI, PSH, SH and DX Calculated the MI and RIDF for the clinical notes for each of the possible sections: CC, CM, IP, HPI, PSH, SH and DX Retrieved the respective text for each heading Retrieved the respective text for each heading Calculate the ridf and mi each possible ngrams with a term frequency greater than 10 for the data under each sections Calculate the ridf and mi each possible ngrams with a term frequency greater than 10 for the data under each sections Noticed that different multi terms appear for each of the different sections Noticed that different multi terms appear for each of the different sections

32 Conclusions Ngram statistics can be applied directly and indirectly to various problems Ngram statistics can be applied directly and indirectly to various problems Directly Directly Spelling correction Spelling correction Compound word identification Compound word identification Term extraction Term extraction Name identification Name identification Indirectly Indirectly Part of Speech tagging Part of Speech tagging Information Retrieval Information Retrieval Data Mining Data Mining

33 Packages Two Statistical Packages Two Statistical Packages Contingency Table approach Contingency Table approach Measures for bigrams Measures for bigrams Log Likelihood, True Mutual Information, Chi Squared Test, 0dds Ratio, Phi Coefficient, T Score, and Dice Coefficient Log Likelihood, True Mutual Information, Chi Squared Test, 0dds Ratio, Phi Coefficient, T Score, and Dice Coefficient Measures for trigrams Measures for trigrams Log Likelihood and True Mutual Information Log Likelihood and True Mutual Information Suffix Array approach Suffix Array approach Measures for all lengths of ngrams Measures for all lengths of ngrams Residual Inverse Document Frequency and Mutual Information Residual Inverse Document Frequency and Mutual Information