Faculty of Computer Science © 2006 CMPUT 605March 31, 2008 Towards Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition Inniss T., Light M., Thomas G., Lee J., Grassi M., Williams A. TMBIO(2006) Amit Satsangi
© 2006 Department of Computing Science CMPUT 605 Focus Ontology for describing age-related macular degeneration (AMD) Comparison of the accuracy of three methods for Ontology – Natural Language Processing (NLP) – Text Mining (SAS Text Miner) – Human Expert Manual and adhoc knowledge acquisition IDOCS (Intelligent Distributed Ontology Consensus System)
© 2006 Department of Computing Science CMPUT 605 Introduction No existing common and standardized vocabulary for classification of disease types for certain eye- diseases Clinicians, dispersed geographically, may use different terms to describe the same condition Research aimed at extracting the feature and attribute descriptions for the vocabulary of AMD, and build an Ontology from that.
© 2006 Department of Computing Science CMPUT 605 Related Work Lot of research done, since 1990’s, for applying NLP techniques in medicine, bio-medicine etc. NLP & Text Data Mining have been recognized to play an important role in this endeavor Research focused on online repositories such as Medline & PubMed NLP systems developed: MedLee, UMLS, GENIES etc.
© 2006 Department of Computing Science CMPUT 605 IDOCS
© 2006 Department of Computing Science CMPUT 605 Methodology Four clinical experts in retinal diseases enlisted to view 100 eye sample images of AMD Experts in different geographic locations Described the observations using digital voice recorders – no artificially imposed vocabulary constraints Another retinal expert for manual parsing of the transcribed text – extracting key words, organization of key-words into categories etc.
© 2006 Department of Computing Science CMPUT 605 Results: Human Experts
© 2006 Department of Computing Science CMPUT 605 Methodology: NLP NLP: Used for information extraction and automatic summarization. Identify short sequences of words having meaning over and above a meaning composed directly from their parts – “extreme programming” Ngram Statistics Package (NSP) used for collocation discovery in case of bi-grams Word-pair associations measured by PMI
© 2006 Department of Computing Science CMPUT 605 Methodology: NLP Large PMI for larger degree of association between the words
© 2006 Department of Computing Science CMPUT 605 Results: NLP
© 2006 Department of Computing Science CMPUT 605 Methodology:Text Mining (SAS Text Miner) Collection of documents (corpus) used as input to any text mining algorithm Corpus broken into tokens or terms (tokens in a particular language) Term weighting Measures: Entropy, Inverse Document Frequency (IDF), Global Frequency (GF) - IDF, None (Global weight of 1) & Normal term wt.
© 2006 Department of Computing Science CMPUT 605 Results: Text Miner Frequency wt. None Term wt. Normal
© 2006 Department of Computing Science CMPUT 605 Common Terms sss
© 2006 Department of Computing Science CMPUT 605 Comparison Thus text mining is a viable and effective method for determining vocabulary to describe a particular disease Text Mining found a lot of terms that NLP found Human Expert is the best Ground Truth
© 2006 Department of Computing Science CMPUT 605 Ontology Generation
© 2006 Department of Computing Science CMPUT 605 Conclusion and Future Work Human experts are the best, but they did miss some key descriptors Text Mining and NLP can enhance the generation of feature generations, by preventing the above case As a consequence more robust vocabulary can be generated Extension – evaluate the effectiveness of the automated tools, text mining & NLP Different weighting schemes to be tried in the future
© 2006 Department of Computing Science CMPUT 605 Thank You For Your Attention!