Development of a Pediatric Text-Corpus for Part-of-Speech Tagging John Pestian 1, Lukasz Itert 1,2, and Włodzisław Duch 2,3 1 BioMedical Informatics, 3333.

Development of a Pediatric Text-Corpus for Part-of-Speech Tagging John Pestian 1, Lukasz Itert 1,2, and Włodzisław Duch 2,3 1 BioMedical Informatics, 3333 Burnet Avenue, Children’s Hospital Research Foundation, Cincinnati, OH 45229, USA {jpestian, litert}@cchmc.org litert}@cchmc.org 2 Department of Informatics, Nicholaus Copernicus University, 87-100 Torun, Poland, duch@ieee.org 3 School of Computer Engineering, Nanyang Technological University, Singapore Zakopane, V-2004

Outline The project The project Tools Tools Data description and problems Data description and problems Software Software Results Results Plans… Plans…

CCHMC project outline (simplified) INPUT (free medical text) Preprocessing MetaMAp Software- UMLS concept discovering and indexing Annotations: Concept Space (UMLS concepts) MetaMap input Conclusions, important relations, any useful information i.e. hypothesis generation and validation Decision support system

CCHMC project Semantic Retrieval System: Semantic Retrieval System: –Retrieving knowledge from clinical annotations database, discovering relations, extracting rules and facts, any useful information; semantic analysis of medical text. –Ontology based approach (Merging UMLS and common sense ontology?) –XML as the standard –In fact: background for the expert systems (an artificial physician helper?)

Goals (too ambitious?) In the final stage we would like our system to help to answer questions like: –Is X related to Y? –Will X help patient with Y? –What causes X? –What causes changes of X? –What are therapy options for X?

UMLS * – the biggest medical ontology Unified Medical Language System started in 1986 and maintained by National Library of Medicine (NLM). Unified Medical Language System started in 1986 and maintained by National Library of Medicine (NLM). http://www.nlm.nih.gov/research/umls/ Goal: “to aid the development of systems that help health professionals and researchers retrieve and integrate electronic biomedical information from a variety of sources.” Goal: “to aid the development of systems that help health professionals and researchers retrieve and integrate electronic biomedical information from a variety of sources.” Consists of three main parts: Consists of three main parts: – –METATHESAURUS – –SEMANTIC NETWORK – –THE SPECIALIST LEXICON

UMLS in numbers Metathesaurus: 875,255 concepts and 2.14 million concept names in its source vocabularies. Semantic Network: 135 semantic types, 54 relationships. SPECIALIST lexicon: 183,000 lexical entries covering more than 292,000 strings.

UMLS – Example (keyword: “virus”) Metathesaurus : Concept: Virus, CUI: C0042776, Semantic Type: Virus Definition (1 of 3): “Group of minute infectious agents characterized by a lack of independent metabolism and by the ability to replicate only within living host cells; have capsid, may have DNA or RNA (not both)”. (CRISP Thesaurus) Synonyms: Virus, Vira Viridae Semantic Network: - relates concepts, not words!!! "Virus" causes "Disease or Syndrome" - relates concepts, not words!!! Other relations: “interacts with”, “contains”, “consists of”, “result of”, “related to”, … Other relations: “interacts with”, “contains”, “consists of”, “result of”, “related to”, … Other types: “Body location or region”, “Injury or Poisoning”, “Diagnostic procedure”, …

UMLS – Example c’d. SPECIALIST lexicon: SPECIALIST lexicon: {base=virus entry=E0064702 entry=E0064702 cat=noun cat=noun variants=reg} variants=reg} “reg” means regular plural form.

UMLS – additional features Extended API (Java) with 20 packages and 151 classes. Extended API (Java) with 20 packages and 151 classes. Easy support for other languages (currently 15). Easy support for other languages (currently 15). The UMLS Knowledge Server - a set of web based interaction tools and a programmer interface to allow users and developers access to the biomedical terminologies found within the UMLS. The UMLS Knowledge Server - a set of web based interaction tools and a programmer interface to allow users and developers access to the biomedical terminologies found within the UMLS.

MetaMap Transfer (MMTx) * MetaMap - set of tools (programs) mapping arbitrary text to concepts in the UMLS Metathesaurus. MetaMap - set of tools (programs) mapping arbitrary text to concepts in the UMLS Metathesaurus. Equivalently: it discovers Metathesaurus concepts in text. Built-in part of speech tagger (helps in syntactic analysis) Built-in part of speech tagger (helps in syntactic analysis) Goal using MetaMap: to get the best possible accuracy i.e. map correctly as many concepts as possible. Goal using MetaMap: to get the best possible accuracy i.e. map correctly as many concepts as possible. * http://mmtx.nlm.nih.gov/

Medical data Clinical annotations: nurses and surgical notes, discharge summaries. Information about symptoms, procedures, findings, therapeutic response. Clinical annotations: nurses and surgical notes, discharge summaries. Information about symptoms, procedures, findings, therapeutic response. Typical annotation: 4 y/o with H/O JRA and 6d H/O fever, headache, photophobia. Presented originally to Grant County and St Luke`s West and was started on Ceftriaxone on 7/30 for presumed sepsis. Typical annotation: 4 y/o with H/O JRA and 6d H/O fever, headache, photophobia. Presented originally to Grant County and St Luke`s West and was started on Ceftriaxone on 7/30 for presumed sepsis. Problems: Problems: –Multiple abbreviations, synonyms, punctuation, misspellings (hand written text), capitalization, spacing, non-letter characters –Understandable by specialists only. –It’s huge

Initial data processing – Encryption Broker software Features: Classify raw input text into paragraph, sentences and words. Classify raw input text into paragraph, sentences and words. Search for medical and common sense abbreviations and symbols in text and replace them with appropriate definitions. Search for medical and common sense abbreviations and symbols in text and replace them with appropriate definitions. Search for ambiguous abbreviations and replace them with proper meaning based on surrounding text. Search for ambiguous abbreviations and replace them with proper meaning based on surrounding text. Proper processing of multi-words. Proper processing of multi-words. Make confidential data harmless. Make confidential data harmless. Proper handling of punctuation, special symbols and exceptions Proper handling of punctuation, special symbols and exceptions Output text is ready to be tagged Output text is ready to be tagged

Tagging process Creating the training set:  Random sample of data from 20.000.000 words set  Hand tagging in India by linguistic group  Use the Penn Treebank tagset  Find UMLS multi-words in text Training:  Training TreeTagger Validation:  10 CV validation (each 1/10 th test part is created by bootstrapping method, the rest gives the training part)  The final results is the average of three independent 10 CV procedures

TreeTagger * Part of speech tagger based on ID3 decision tree. Part of speech tagger based on ID3 decision tree. Tagging is performed by analyzing the context of a word using trigrams. Tagging is performed by analyzing the context of a word using trigrams. Needs both the lexicon (base-list of words) and the training set (providing information about correlation between part of speech names within a sentence). Needs both the lexicon (base-list of words) and the training set (providing information about correlation between part of speech names within a sentence). Fast and has great support for misspelled words as well as words non-existing in the lexicon. Fast and has great support for misspelled words as well as words non-existing in the lexicon. As a base lexicon we use its own lexicon and UMLS lexicon converted to suitable format As a base lexicon we use its own lexicon and UMLS lexicon converted to suitable format *H. Schmid, Probabilistic Part-of-Speech Tagging Using Decision Trees, In Proceedings of the Conference on New Methods in Language Processing, 1994.

Tagging results (395.000 set) Figure 1. The tagging accuracy vs. the training set size. All results are from the 10-CV tests. Black squares show points where actual calculations were done.

Training vs. testing accuracy Figure 2. Accuracies on the training (the dashed line) and the test set (the solid line) for 215000 words.

Unique (occurring once) trigrams Figure 3. The percentage of unique trigrams vs. the training set size.

Conclusions The accuracy of the tagging system is going up as the size of the training set is increasing The accuracy of the tagging system is going up as the size of the training set is increasing Number of unique trigrams is decreasing as the training set grows – ML methods are supposed to give better results Number of unique trigrams is decreasing as the training set grows – ML methods are supposed to give better results Multi words support helps to increase the accuracy; bigger multi words contribution is expected in the MetaMap mapping process Multi words support helps to increase the accuracy; bigger multi words contribution is expected in the MetaMap mapping process Results from training set indicate that the accuracy limit for tests is near 93% Results from training set indicate that the accuracy limit for tests is near 93%

Plans 1. Cleaning the text: many misspellings 1. Cleaning the text: many misspellings 2. About 1000 ambiguous acronyms in >700K trigrams – disambiguation rules? 2. About 1000 ambiguous acronyms in >700K trigrams – disambiguation rules? 3. Recognition memory techniques applied to sequence => term mapping. 3. Recognition memory techniques applied to sequence => term mapping. 4. Semantic corrections via high-dimensional vector representation of the words; later episodic memory. 4. Semantic corrections via high-dimensional vector representation of the words; later episodic memory. 5. Creation of XML and later DAML versions of annotations from unstructured text. 5. Creation of XML and later DAML versions of annotations from unstructured text. 6. Discovery System for intelligent decision support. 6. Discovery System for intelligent decision support.

Development of a Pediatric Text-Corpus for Part-of-Speech Tagging John Pestian 1, Lukasz Itert 1,2, and Włodzisław Duch 2,3 1 BioMedical Informatics, 3333.

Similar presentations

Presentation on theme: "Development of a Pediatric Text-Corpus for Part-of-Speech Tagging John Pestian 1, Lukasz Itert 1,2, and Włodzisław Duch 2,3 1 BioMedical Informatics, 3333."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Development of a Pediatric Text-Corpus for Part-of-Speech Tagging John Pestian 1, Lukasz Itert 1,2, and Włodzisław Duch 2,3 1 BioMedical Informatics, 3333.

Similar presentations

Presentation on theme: "Development of a Pediatric Text-Corpus for Part-of-Speech Tagging John Pestian 1, Lukasz Itert 1,2, and Włodzisław Duch 2,3 1 BioMedical Informatics, 3333."— Presentation transcript:

Similar presentations

About project

Feedback