Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany, NY, USA Rochester, MN MCORES: a system for noun phrase coreference resolution for clinical records 2012 SHARPn Summit “Secondary Use”
Medical coreference resolution system (MCORES) Experimental results Conclusion Page 2
Electronic Medical Records (EMRs) – large information repositories Clinical information requires processing Lower level: sentence parsing, tokenization Higher level: coreference resolution, semantic disambiguation Coreference resolution: a fundamental step in text processing Page 3
English medical corpus provided by i2b2 National Center for Biomedical Computing De-identified medical discharge summaries ▪ Source: PH & BIDMC ▪ Content: 230(PH) + 196(BIDMC) discharge summaries Annotated concepts and coreference chains Concept types Page 4 Persons Problems Treatments Tests Pronouns
NP Instance Creation Feature Generation Classification Output Clustering Page 5
Markables of same semantic category are paired together MCORES creates positive instances only from neighboring markable pairs in a chain 1 Instance creation akin to McCharty and Lehnert Page 6
Page 7 Table 3: Distribution of coreferent and non-coreferent instances per semantic category over instances containing exact, partial, and no textual overlap.
Multi-perspective features Antecedent perspective Anaphor perspective Greedy perspective Stingy perspective Phrase-level lexical Sentence-level lexical Syntactic Semantic Miscellaneous Page 8
Phrase-level lexical Token overlap* Normalized token overlap Edit-distance Normalized edit-distance Sentence-level lexical Sentence-level token overlap* Filtered sentence-level token overlap* Left and right mention overlap stingy and greedy perspectives only Page 9 * multi-perspective feature
Syntactic Number agreement Noun overlap* Surname match Semantic UMLS CUI overlap* UMLS CUI token overlap* UMLS semantic type overlap* Anaphor UMLS semantic type Page 10 * multi-perspective feature
Token distance Mention distance All-mention distance Sentence distance Section match Section distance Page 11
C4.5 decision tree algorithm Flexible Readable prediction model Classify pairs of markables based on values of the feature vectors Page 12
Classifier makes pairwise predictions only Pairwise predictions clustered into coference chains Aggressive-merge 1 clustering algorithm prediction [M 1 ] - [M 2 ] all preceding pairwise predictions linked to [M 1 ]or [M 2 ] 1 Aggresive-merge algorithm proposed by McCarthy and Lehnert Page 13
Feature set evaluation Perspectives evaluation Performance evaluation against In house baseline Third party system (RECONCILE ACL09 & BART) Evaluation metric: unweighted averages of Recall, Precision, and F-measures of MUC B 3 CEAF BLANC Page 14
Page 15
MCORES’ advantage comes from linking markables with no token overlap Phrase-level sub-MCORES performs similarly to MCORES Greedy perspective system is the most favorable single-perspective system Multi-perspective system performs as well or better than single-perspective systems Error analysis MCORES fails to classify misspelled person pairs Medical problems false positives due to difference between newly and recurring events Treatments false positives due to medications presenting different routes of administration Tests false positive due to the large number of full overlap instances that did not corefer Page 16
Developed coreference resolution system for the medical domain (MCORES) MCORES innovates through a multi-perspective and knowledge-based feature set MCORES outperforms third party systems and an in-house baseline, improving coreference resolution on clinical records Page 17