Data Normalization Dr. Stan Huff
# 2 Acknowledgements Tom Oniki Joey Coyle Craig Parker Yan Heras Cessily Johnson Roberto Rocha Lee Min Lau Alan James Many, many, others…
# 3 What are detailed clinical models? Why do we need them?
# 4 A diagram of a simple clinical model data 138 mmHg quals SystolicBP SystolicBPObs data Right Arm BodyLocation data Sitting PatientPosition Clinical Element Model for Systolic Blood Pressure
# 5 Need for a standard model A stack of coded items is ambiguous (SNOMED CT) –Numbness of right arm and left leg Numbness ( ) Right ( ) Arm ( ) Left ( ) Leg ( ) –Numbness of left arm and right leg Numbness ( ) Left ( ) Arm ( ) Right ( ) Leg ( )
# 6 70 What if there is no model? Hct, manual: Site #1 % Hct : Site #2 Manual % Auto Hct, auto : % 35 Estimated
HL7 V2.X Messages Site 1: OBX|1|CE|4545-0^Hct, manual||37||%| OBX|1|CE|4544-3^Hct, auto||35||%| Site 2: OBX|1|CE| ^Hct||37||%|….|manual| OBX|1|CE| ^Hct||35||%|….|auto|
# 8 Too many ways to say the same thing A single name/code and value –Hct, manual is 37 % Two names/codes and values –Hct is 37 % Method is manual (spun)
# 9 Model fragment in XML Pre-coordinated representation Hct, manual (LOINC ) 37 % Post-coordinated (compositional) representation Hct (LOINC ) Method Manual 37 %
# 10 Isosemantic Models data 37 % HematocritManual (LOINC ) HematocritManualModel data 37 % quals Hematocrit (LOINC ) HematocritModel data Manual Hematocrit Method HematocritMethodModel Precoordinated Model Post coordinated Model (Storage Model)
# 11 Relational database implications If the patient’s hematocrit is <= 35 then …. Patient Identifier Date and TimeObservation TypeObservation Value Units /4/2005Hct, manual37% /19/2005Hct, auto35% Patient Identifier Date and TimeObservation Type Weight typeObservation Value Units /4/2005Hctmanual37% /19/2005Hctauto35%
# 12 More complicated items: Signs, symptoms Diagnoses Problem list Family History Use of negation – “No Family Hx of Cancer” Description of a heart murmur Description of breath sounds –“Rales in right and left upper lobes” –“Rales, rhonchi, and egophony in right lower lobe”
# 13 What do we model? All health care data, including: –Allergies –Problem lists –Laboratory results –Medication and diagnostic orders –Medication administration –Physical exam and clinical measurements –Signs, symptoms, diagnoses –Clinical documents –Procedures –Family history, medical history and review of symptoms
# 14 How are the models used? EMR: data entry screens, flow sheets, reports, ad hoc queries –Basis for application access to clinical data Data normalization –Creation of maps from models in the local system to the standard model Target for the output of structured data from NLP –Validation of data as it is stored in the database Phenotype algorithms (decision logic) –Basis for referencing data in phenotype definitions Does NOT dictate physical storage strategy
# 15 Model Source Expression (CDL) model BloodPressurePanel is panel { key code(BloodPressurePanel_KEY_ECID); statement SystolicBloodPressureMeas systolicBloodPressureMeas optional systolicBloodPressureMeas.methodDevice.conduct(methodDevice) systolicBloodPressureMeas.bodyLocationPrecoord.conduct(bodyLocationPrecoord) systolicBloodPressureMeas.bodyPosition.conduct(bodyPosition) systolicBloodPressureMeas.relativeTemporalContext.conduct(relativeTemporalContext) systolicBloodPressureMeas.subject.conduct(subject) systolicBloodPressureMeas.observed.conduct(observed) systolicBloodPressureMeas.reportedReceived.conduct(reportedReceived) systolicBloodPressureMeas.verified.conduct(verified); statement DiastolicBloodPressureMeas diastolicBloodPressureMeas optional …. statement MeanArterialPressureMeas meanArterialPressureMeas optional …. qualifier MethodDevice methodDevice optional; md.code.domain(BloodPressureMeasurementDevice_DOMAIN_ECID); qualifier BodyLocationPrecoord bodyLocationPrecoord optional; blp.code.domain(BloodPressureBodyLocationPrecoord_DOMAIN_ECID); modifier Subject subject optional; attribution Observed observed optional; attribution ReportedReceived reportedReceived optional; attribution Verified verified optional; }
# 16 Compiler CE Source File CE Translator “In Memory” Form HTML SMArt RDF? openEHR Archetype? HL7 RIM Static Models? Java Class XML Template -.xsd OWL? UML?
Artifacts Used CDL Model Definition CEM XML Schema HL7 Data Source CEM XML Instance
StandardLabObsQuantitative - CDL Definition import StandardLabObs; import ReferenceRangeNar; model StandardLabObsQuantitative is statement extends StandardLabObs { key domain(StandardLabObsQuantitative_KEY_VALUESET_ECID); data PQ primaryPQValue unit.domain (UnitsOfMeasure_VALUESET_ECID) alternate { match CD secondaryCDValue code.domain(LabValue_VALUESET_ECID); match CD altCDValue code.domain(LabValue_VALUESET_ECID); otherwise ST altSTValue; }; qualifier ReferenceRangeNar referenceRangeNar card(0..1); constraint primaryPQValue.isNullReasonCode.domain(LabNullFlavor_VALUESET_ECID); constraint abnormalInterpretation.CD.code.domain (AbnormalInterpretationNumericNom_VALUESET_ECID); constraint deltaFlag.CD.code.domain (DeltaFlagNumericNom_VALUESET_ECID); }
StandardLabObsQuantitative - Schema Snippet \
HL7 Source Instance MSH|^~\&|OADD|153|DADD|XNEPHA| ||ORU^R01| |T|2.2|||| EVN|R01| | PID|| | |007261|WHYLING^KAYLIE^O'TEST|| |F| |W|||(801) |(866) |||| | | PV1||O|XNEPHA^XNEPHA^^IM||||28826^Allyson^Josephine^ O'TEST |^||||||||||OP|||||||||||||||||||||||||| |||||||| ORC|RE||F506556|||||||||28826^Allyson^Josephine^ O'TEST ||||^| OBR||^|F506556^|HCT^HEMATOCRIT|R|| |||70011^ROSEN,A UBRY^ O'TEST ||| |^|28826^Allyson^Josephine^ O'TEST ||||M ||||C|F|RFP^RFP|^^^^^R|^~^~^||||||| OBX|1|NM|HCT^HEMATOCRIT|1.1|48|%|||R||F||| |IM^Perfor med at Inte|58528^ANDERSON^MARK|
LabObsQuantitative - XML Instance Snippet LOINC HCT equals % 48 … …
# 22 Issues Different groups use models differently –NLP versus EMR Structuring the models to meet more than one use Options for different granularities of models –Hematocrit model, model of pneumonia –Quantitative lab result model, x-ray finding Terminology integration – use of standards and terminology services Models for “rare” kinds of data –Medication being taken by a friend, not recommended by the physician
# 23 Questions?
Data Normalization Dr. Christopher Chute
IHC-Medication, Mayo, IHC LAB to CEM HL7 (Meds) HL7 (Meds) HL7 Initializer HL7 Initializer IHC-GCN TO- RXNORM Annotator IHC-GCN TO- RXNORM Annotator Drug CEM CAS Consumer Drug CEM CAS Consumer Mirth SharpDb HL7 (Labs) HL7 (Labs) HL7 Initializer HL7 Initializer Generic- LAB- Annotator Generic- LAB- Annotator LAB CEM CAS Consumer LAB CEM CAS Consumer Mayo LOINC resource Mayo LOINC resource IHC LOINC resource IHC LOINC resource IHC RXNORM resource IHC RXNORM resource
UIMA Normalization Pipeline Convert HL7 V2.x Lab / Med Order Messages into CEM XML instances –Load SofA with HL7 message –Create Segment Objects in CAS –Normalize Segments in CAS –Transform Segments into CEM instances
Mayo, IHC LAB to CEM Mirth SharpDb HL7 (XML) HL7 (XML) HL7 Initializer HL7 Initializer LAB CEM CAS Consumer LAB CEM CAS Consumer Mayo LOINC resource Mayo LOINC resource IHC LOINC resource IHC LOINC resource One of the new pipelines created to normalize HL7 2.x Lab Messages into CEM instances. We pre-processed the HL7 messages converting from HL7 pipe syntax into HL7 XML format. Mirth HL7 Pipe Delimited HL7 Pipe Delimited Generic- LAB- Annotators Generic- LAB- Annotators Generic- LAB- Annotators Generic- LAB- Annotators
CAS (SOFA=HL7-XML) CAS (SOFA=HL7-XML) HL7 message CAS PID PV1 OBX CEM Initialize Parse Normalize Transform UIMA Pipeline Flow Mayo, IHC LAB to CEM
Normalization Anatomy Lab Annotators HL7 Segment Parser HL7 Segment Parser Date-Time To ISO Format Date-Time To ISO Format Syntactic Integrity LOINC lookups IHC codes to LOINC table Mayo codes to LOINC table LexGrid/CTS2 Terminology Services
Architectural Opportunities Mirth CAS To XML CAS To XML Mirth HL7 2.x CDA CEM format HL7 2.x Mayo CDA CEM format Time, Syntax Etc. Time, Syntax Etc. Semantic
Tactical Next Step Enhancements Single CEM for multiple OBX segments Efficiently utilize terminology services Incorporate a library for HL7 clean-up routines Increase scope of vocabulary standardization Enhancements for the Drug Annotator –Context enhancement issue –Drug name surprises
Additional Vocabularies Review sources used for normalization opportunities E.g. –In HL7 OBR Segments Standardize Service ID (Codes) –In HL7 OBX Segments Standardize Units Standardize Reference Ranges Standardize Normal Flags
Drug Name Disambiguity Real patient data, presented a unique case in drug names. “ToDAY” is brand name for: cephapirin sodium. This presents an interesting named entity disambiguation use case.
Where Persistence Fits In… 10 IHC (Backend CDR Systems) Mirth Connect IHC NwHIN Aurion Gateway SHARP NwHIN Aurion Gateway Mirth Connect UIMA Pipeline CEM Instance Database a Mayo EDT System
Persistence Channels One Channel per model Data stored as an XML Instance of the model Fields extracted from XML to use as indices XML Schema defined for each model Stored using database transactions CEM ModelMirth Channel Administrative DiagnosisCemAdminDxToDatabase Standard Lab PanelCemLabToDatabase Ambulatory Medication Order CemMedicationToDatabase
General Channel Design Input Message Directory Channel CEM XML Instance Processed Message Directory Error Message Directory Persistence Store Connector
SharpDB a CEM Instance Database
Database Tables TablePurpose DemographicsPatient demographics (One row per patient) PatientCrossReferenceAssociates internal Patient ID with Site Patient ID (One row per cross map) SourceDataInformation about the original source data (One row per instance message) PatientDataCEM Instance XML with some source information (One row per instance message) IndexDataIndices into the XML instance. (Multiple rows per instance message AdminDx – One per message Lab – One per observation Medication – One per orderable item.)
Patient Demographics Each message contains patient demographics Demographics created on first received message based on site patient ID Internal Patient ID is created and cross mapped to site patient ID SharpDB is keyed off internally generated Patient ID
Running in a Cloud… Various images were installed: –NwHIN Gateway provided by Aurion –MIRTH Connect our interface engine –UIMA Pipelines of various sorts –MySQL database for persistence –JBOSS / Drools rules engine All open source, running in a Ubuntu Cloud!
Node Controller Cloud ControllerWalrus Controller Cloud Server Node Server 1 SHARP Hardware Infrastructure Admin Client Interfa ce VPN / LAN Node Controller VM To Manage Cloud VM User VPN / LAN To Connect To Instances Persistence Storage Node Server 3 Node Controller VM Node Server 2 Node Controller VM Node Server 11 VM HardwareNo. of Physical Machines CPUMemoryDiskDisk SpaceNetworkingFunctionalityNo. of NICs Cloud Server1812 GB10000 RPM SAS1 TB1 Gbps Cloud, Walrus, Cluster and Storage Controller 4 Node Server1832 GB10000 RPM SAS1 TB1 GbpsNode Controller4 Node Server GB10000 RPM SAS600 GB/600 GB1 GbpsNode Controller4 Node Server1864 GB7200 RPM SATA1 TB/1 TB1 GbpsNode Controller4 Node Server1832 GB10000 RPM SAS4 TB1 GbpsNode Controller4 Build/Backup Server128 GB7200 RPM SATA2 TB1 GbpsBuild and Backup2 Storage RPM SAS7.5 TB1 GbpsPersistence and Image Storage Storage RPM SAS3.6 TB1 GbpsVolume Storage Cisco 48 Port Switch21 GB Image Storage … Private Switch Build/Backup Server Cluster ControllerStorage Controller
Data Normalization Summary Initial “tracer shot” at Data Normalization –Cloud based processing using open source tools –Proof on concept, UIMA for Data Normalization –Move on to new problems / solutions… –Opportunities exist: Add new annotators (modules) to the pipelines Widen usage and scope of vocabulary services Switch to real live flows and add HOSS clean up routines. Various tweaks in NLP algorithms