cTAKES The clinical Text Analysis and Knowledge Extraction System
cTAKES Overview Open source software Natural Language Processing (NLP) Developed at Mayo Clinic Contributed to the Open Health Natural Language Processing (OHNLP) Consortium Built on the Apache UIMA framework Unstructured Information Management Architecture UIMA framework itself is also open source
Open Health Natural Language Processing (OHNLP) Consortium Goal: Foster an open-source collaborative community around clinical NLP that can deliver best-of-breed annotators, leverage the dynamic features of UIMA flow- control, and establish the infrastructure for clinical NLP. www.ohnlp.org
www.ohnlp.org Gateway to News Documentation Downloads Forums for asking questions Bug tracker for reporting issues List of publications
cTAKES Goals Phenotype extraction Generic – to be used for a variety of retrievals and use cases Expandable – at the information model level and methods Modular Cutting edge technologies – best methods combining existing practices and novel research with rapid technology transfer Best software practices (80M+ notes) Commitment to both R and D in R&D
Original cTAKES Components Sentence boundary detection (OpenNLP technology) Tokenization (rule-based) Morphologic normalization (NLM’s LVG) POS tagging (OpenNLP technology) Shallow parsing (OpenNLP technology) Named Entity Recognition Dictionary mapping (lookup algorithm) Negation and context identification (both based on NegEx)
Original cTAKES Named Entities Drug mentions Disease/disorder mentions Sign/symptom mentions Anatomical site mentions With these attributes RxNorm code or Concept Unique Identifier (CUI) and SNOMED-CT codes. Negation (denies chest pain) Status (history of, family history of, possible/probable)
Additional cTAKES Components Smoking status classifier More detailed drug mention annotator dosage route form drug change status and more Peripheral Artery Disease (PAD) annotator Dependency parser
Output Example: Disorder Object “No evidence of unstable angina.” Text: unstable angina Associated codes: SNOMED 4557003 UMLS CUI C0002965 Named entity type: disease/disorder Negation: true
cTAKES Configuration Options XML configuration files Control many things, such as Dictionary location Dictionary format Which dictionaries to use Type of input (plain text or CDA) Forums contain details on creating your own dictionary
cTAKES Methods Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications Guergana K Savova, James J Masanz, Philip V Ogren, Jiaping Zheng, Sunghwan Sohn, Karin C Kipper-Schuler, Christopher G Chute. JAMIA 2010;17:507-513
References http://www.ohnlp.org http://uima.apache.org