Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics
Tissue Microarrays
Stanford tissue microarray database
Key analysis issue Tissue microarrays query a large number of samples/patients for one protein. The key query dimension in TMA data is a tissue sample Because of the lack of a commonly used ontology to describe the diagnosis [or annotations] for a given TMA sample in TMAD it is not easy to perform such as query.
Ontologies considered The NCI Thesaurus, version 05.09g The SNOMED-CT, from UMLS 2005 AA
Available annotations for a block Each donor block in the TMA has semi- structured text associated with it. IDOrganDiagnosisSubclass 1Subclass 2Subclass 3Subclass OvaryMMMT 3335ProstateCarcinomaAdenointraductal 7022BladderCarcinomaTransitional cell In situ 7288TestisteratomaimmatureEmbryonal carcinoma 8060LiverCarcinomahepatocellularNo vascular invasion HepC cirrhosis 6662Soft tissueSarcomaLeiomyoepithelioid 6663lungSarcomaLeiomyoepithelioid 4713stomachcarcinomaunknown
Map text to ontology terms Make all possible permutations Rules to weed out bad permutations Check for an exact match with NCI and SNOMED-CT terms (and/or synonyms) Rules to weed out bad matches ProstateCarcinomaAdenointraductal 24 permutations Prostate Carcinoma Adeno intraductal : Carcinoma Prostate intraductal Adeno : Adeno Carcinoma intraductal Prostate : Prostate intraductal Adeno Carcinoma Prostate_Ductal_Adenocarcinoma
Sample matches (from NCI-T) OrganDiagnosisSubclass 1Subclass 2Subclass 3Ontology Terms 2334OvaryMMMTMalignant_Mixed_Mesodermal_Mullerian_Tu mor 3335ProstateCarcinomaAdenointraductalProstate_Ductal_Adenocarcinoma 7022BladderCarcinomaTransitional cell In situStage_0_Transitional_Cell_Carcinoma Transitional_Cell_Carcinoma Bladder_Carcinoma Carcinoma_in_situ 7288TestisteratomaimmatureEmbryonal carcinoma Immature|Teratoma Testicular_Embryonal_Carcinoma Immature_Teratoma 8060LiverCarcinomahepatocellularNo vascular invasion HepC cirrhosis Hepatocellular_Carcinoma 6662Soft tissueSarcomaLeiomyoepithelioidSoft_Tissue_Sarcoma Leiomyosarcoma Epithelioid_Sarcoma 6663lungSarcomaLeiomyoepithelioidLung_Sarcoma Leiomyosarcoma Epithelioid_Sarcoma 4713stomachcarcinomaunknownGastric_carcinoma
Results and validation Mapped the term-sets for 8495 records, which correspond to 783 distinct term-sets. 577 term-sets (6614 records) matched to the NCI thesaurus 365 term-sets (3465 records) matched to SNOMED-CT In total mapped 6871 records (80%) of annotated records in TMAD (641 distinct term-sets) to one or more ontology terms. Validation NCISNOMED-CT AppropriateInappropriateAppropriateInappropriate Set Set Set Total Average (%)43.0 (86%)7.0 (14%)40.66 (81%)9.33 (19%)
Browsing interface
Parents & Siblings nodes with data (Burly wood) Child nodes with data (Yellow) Child nodes with no data (Grey)
Click on the “anchor” link to get data
Updates since February
How do ontology based annotation help? Better search: we can retrieve samples of all the retroperitoneal tumors or malignant uterine neoplasms for example. Better Integration of data: we can correlate gene expression with protein expression across multiple tumor types. Tissue microarray data from TMAD Gene expression data from GEO
Integrating mRNA and protein expression Proteins Samples Genes Samples
Partial alignment of NCI-T and SNOMED-CT as a “bonus”
Steps in Alignment Anchor identification Identify similar class labels in the ontologies to be aligned Usually done by string matching Ontology structure Use the “similar” classes as anchors and examine the local [graph] structure around them to inform the “similarity” metric Root Term-1 Term-2 Term-3 Term-4 Term-5 R t1 t2 t4 t5 t6 t7 t3
We might improve alignment … Root Term-1 Term-2 Term-3 Term-4 Term-5 R t1 t2 t4 t5 t6 t7 t3 Term-2 t1 Term-5 t5 Ontology [graph] structure based step Provide Anchors from annotated data S2 t5 Term-5 S2 t5 Term-5
Better Text-mapping Better Alignment 2/177/ Distinct Terms Terms with NCI match Terms with SNOMEDCT match Terms with any match Terms with both match
Summary Ability to map word-groups to ontology terms
Credits and acknowledgements Pathology Robert Marinelli Matt van de Rijn Medical Informatics Kaustubh Supekar Daniel Rubin Mark Musen Funding NIH