Determining the Syntactic Structure of Medical Terms in Clinical Notes Bridget T. McInnes¹ Ted Pedersen² and Serguei V. Pakhomov¹ University of Minnesota¹.

Slides:



Advertisements
Similar presentations
Chapter 18: The Chi-Square Statistic
Advertisements

Genetic Statistics Lectures (5) Multiple testing correction and population structure correction.
EMNLP, June 2001Ted Pedersen - EM Panel1 A Gentle Introduction to the EM Algorithm Ted Pedersen Department of Computer Science University of Minnesota.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Chapter 4 Probability and Probability Distributions
Calculate Probability of a Given Outcome © Dale R. Geiger
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
K NOWLEDGE - BASED M ETHOD FOR D ETERMINING THE M EANING OF A MBIGUOUS B IOMEDICAL T ERMS U SING I NFORMATION C ONTENT M EASURES OF S IMILARITY Bridget.
Elementary Probability Theory
CSE 322: Software Reliability Engineering Topics covered: Techniques for prediction.
Determining the Syntactic Structure of Medical Terms in Clinical Notes Bridget T. McInnes Ted Pedersen Serguei V. Pakhomov
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
An Unsupervised Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline Bridget T McInnes University of Minnesota Twin Cities Background.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Faculty of Computer Science © 2006 CMPUT 605March 31, 2008 Towards Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition.
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes.
Lecture 5: Simple Linear Regression
Simple Linear Regression and Correlation
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Econometrics: The empirical branch of economics which utilizes math and statistics tools to test hypotheses. Special courses are taught in econometrics,
Elementary Probability Theory
Anthony Greene1 Correlation The Association Between Variables.
(a.k.a: The statistical bare minimum I should take along from STAT 101)
Chapter 4 Correlation and Regression Understanding Basic Statistics Fifth Edition By Brase and Brase Prepared by Jon Booze.
By : Afifah Elhan & Noah Joseph
Multinomial Distribution
LOCAL EXPERIENCES Innovation practices and experiences related to FIC development and implementation Xavier Pastor, Artur Conesa, Raimundo Lozano-Rubí.
Chapter 9: The Origins of Genetics. Probability Likelihood that a specific event will occur Likelihood that a specific event will occur Can be expressed.
Giovanna Brancato, Giorgia Simeoni Istat, Italy European Conference on Quality in Official Statistics – Q2008, Rome, 8-11 July 2008 Modelling Survey Quality.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Week 5: Logistic regression analysis Overview Questions from last week What is logistic regression analysis? The mathematical model Interpreting the β.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
PS 225 Lecture 21 Relationships between 3 or More Variables.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
AP STATISTICS Section 6.1 Simulations. Objective: To be able to create and carry out a simulation. Probability: the branch of mathematics that describes.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Journal: 1)Suppose you guessed on a multiple choice question (4 answers). What was the chance that you marked the correct answer? Explain. 2)What is the.
Statistics 1: Introduction to Probability and Statistics Section 3-2.
Chapter 14 Chi-Square Tests.  Hypothesis testing procedures for nominal variables (whose values are categories)  Focus on the number of people in different.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
AP STATISTICS Section 7.1 Random Variables. Objective: To be able to recognize discrete and continuous random variables and calculate probabilities using.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Unit 6 Probability & Simulation: the Study of randomness Simulation Probability Models General Probability Rules.
Handbook for Health Care Research, Second Edition Chapter 11 © 2010 Jones and Bartlett Publishers, LLC CHAPTER 11 Statistical Methods for Nominal Measures.
Chapter 14 Introduction to Regression Analysis. Objectives Regression Analysis Uses of Regression Analysis Method of Least Squares Difference between.
1 A latent information function to extend domain attributes to improve the accuracy of small-data-set forecasting Reporter : Zhao-Wei Luo Che-Jung Chang,Der-Chiang.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
3.3. SIMPLE LINEAR REGRESSION: DUMMY VARIABLES 1 Design and Data Analysis in Psychology II Salvador Chacón Moscoso Susana Sanduvete Chaves.
Elementary Probability Theory
Logistic Regression When and why do we use logistic regression?
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Section 14.2 Applications of Counting Principles to Probability
Why is this important? Requirement Understand research articles
Basic Estimation Techniques
Machine Learning Basics
Bridget McInnes Ted Pedersen Serguei Pakhomov
VIII. ARBITRAGE AND HEDGING WITH OPTIONS
Using UMLS CUIs for WSD in the Biomedical Domain
VIII. ARBITRAGE AND HEDGING WITH OPTIONS
Basic Estimation Techniques
Generalizations of Markov model to characterize biological sequences
Statistics 1: Introduction to Probability and Statistics
VIII. ARBITRAGE AND HEDGING WITH OPTIONS
Feature Selection Methods
Processes & Patterns Spatial Data Models 5/10/2019 © J.M. Piwowar
Correlation & Regression
Presentation transcript:

Determining the Syntactic Structure of Medical Terms in Clinical Notes Bridget T. McInnes¹ Ted Pedersen² and Serguei V. Pakhomov¹ University of Minnesota¹ University of Minnesota Duluth²

Syntactic Structure of Terms w1 w2 w3 MonolithicNon-branchingRight-branchingLeft-branching black = independence green = dependence

Syntactic Structure of Terms w1 w2 w3 MonolithicNon-branchingRight-branchingLeft-branching black = independence green = dependence difficulty finding words

Syntactic Structure of Terms w1 w2 w3 MonolithicNon-branchingRight-branchingLeft-branching black = independence green = dependence difficulty finding wordsserum dioxin level

Syntactic Structure of Terms w1 w2 w3 MonolithicNon-branchingRight-branchingLeft-branching black = independence green = dependence difficulty finding wordsserum dioxin level urinary tract infection

Syntactic Structure of Terms w1 w2 w3 MonolithicNon-branchingRight-branchingLeft-branching black = independence green = dependence difficulty finding wordsserum dioxin level urinary tract infection low back pain

Goal Simple but effective approach to identify the syntactic structure of three-word medical terms

Motivation  Potentially improve the analysis of unrestricted medical text Unsupervised syntactic parsing Mapping of medical terms to standardized terminologies

Related Work  Previously Resnik, 1993 Resnik and Hirst, 1993 Pustejovsky, Anick and Bergler, 1993 Lauer, 1995  Currently Lapata and Keller, 2004 Nakov and Hirst, 2005  Medical Domain Nakov and Hirst, 2005

Example small bowel obstruction

Syntactic Structure small bowel obstruction MonolithicNon-branchingRight-branchingLeft-branching

Method used to determine the structure of a term The Log Likelihood Ratio is the ratio between the observed probability of a term occurring and the probability it would be expected to occur

Log Likelihood Ratio The expected probability of a term is often based on the Non-branching (Independence) Model P(small bowel obstruction) P(small) P(bowel) P(obstruction)

Log Likelihood Ratio The expected probability of a term is often based on the Non-branching (Independence) Model P(small bowel obstruction) P(small) P(bowel) P(obstruction) OBSERVED PROBABILITY

Log Likelihood Ratio The expected probability of a term is often based on the Non-branching (Independence) Model P(small bowel obstruction) P(small) P(bowel) P(obstruction) EXPECTED PROBABILITY

Extended Log Likelihood Ratio The expected probabilities can be calculated using two other models Non-branchingRight-branchingLeft-branching P(small)P(bowel)P(obstruction)P(small bowel) P(obstruction)P(small) P(bowel obstruction)

Three Log Likelihood Ratio Equations P(small bowel obstruction) P(small) P(bowel) P(obstruction) P(small bowel obstruction) P(small bowel) P(obstruction) P(small bowel obstruction) P(small) P(bowel obstruction) Non-branching Right-branchingLeft-branching

Expected Probability The expected probability of a term differs as does the Log Likelihood Ratio Non-branchingRight-branchingLeft-branching P(small) P(bowel) P(obstruction)P(small bowel) P(obstruction)P(small) P(bowel obstruction) LL = 11, LL = 5,169.81LL = 8,532.90

Model Fitting The model with the lowest Log Likelihood Ratio that best describes the underlying structure of the term Non-branchingRight-branchingLeft-branching P(small) P(bowel) P(obstruction)P(small bowel) P(obstruction)P(small) P(bowel obstruction) LL = 11, LL = 5,169.81LL = 8,532.90

ReCap  The Log Likelihood Ratio is calculated for each possible model Non-branching Left branching Right branching  The probabilities for each model are calculated using frequency counts from a corpus  Term is assigned structure whose model has the lowest Log Likelihood Ratio

Test Set 708 three word terms from the SNOMED-CT 73 terms MonolithicNon-branchingRight-branchingLeft-branching 6 terms378 terms251 terms

Test Set  Syntactic structure determined by two medical text indexers Kappa =  Frequency counts obtained from over 10,000 clinical notes from the Mayo Clinic

Results with Monolithic Terms Technique Percentage agreement with human experts

Results without Monolithic Terms Technique Percentage agreement with human experts

Limitations  Does not identify Monolithic Terms Collocation extraction Dictionary lookup  Number of words in term grows so does the number of models Limit length of terms to 5 words

Conclusions  Simple but effective method for identifying three-word terms  Method uses the Log Likelihood Ratio  Easily extended to four and five word terms

Future Work  Improve accuracy Explore other measures of association  Dice coefficient, phi... Incorporate multiple measures  Extend method to four and five word terms

Thank you Software: Ngram Statistic Package (NSP) Log Likelihood Ratio Models

Log Likelihood Equation

Expected Values Non-branching: Left-branching: Right-branching:

Non-branching: m xyz = n x++ * n +y+ * n ++z / n +++ Left-branching: m xyz = n xy+ * n ++z / n +++ Right-branching: m xyz = n x++ * n +yz / n +++