Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics

Slides:



Advertisements
Similar presentations
Database Relationships in Access As you recall, the data in a database is stored in tables. In a relational database like Access, you can have multiple.
Advertisements

BioPortal: A Web Repository and Services for Biomedical Ontologies and Data Resources Natasha Noy and the BioPortal team Stanford Center for Biomedical.
NCBO-I2B2 Collaboration Overview and Use Cases Nigam Shah
What is RefSeqGene?.
Pathway Knowledge Base: A Public Repository for Searching Biological Pathways Nikesh Kotecha 1, Kyle Bruck 1, William Lu 1 and Nigam Shah 1 1 Department.
Oncomine Database Lauren Smalls-Mantey Georgia Institute of Technology June 19, 2006 Note: This presentation contains animation.
EleMAP: An Online Tool for Harmonizing Data Elements using Standardized Metadata Registries and Biomedical Vocabularies Jyotishman Pathak, PhD 1 Janey.
Aki Hecht Seminar in Databases (236826) January 2009
Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
Integrating Literature and Experimental Data Fan Meng, Ph.D. Microarray Laboratory Psychiatry Department and Molecular & Behavioral Neuroscience Institute.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Introduction to Bioinformatics - Tutorial no. 12
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Constructing Signature Graphs for Signature Files Dr. Yangjun Chen Dept. Applied Computer Science University of Winnipeg Canada.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
An introduction to using the AmiGO Gene Ontology tool.
Automatic methods for functional annotation of sequences Petri Törönen.
Information Extraction with Linked Life Data 19/04/2011.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
SeqExpress: Introduction. Features Visualisation Tools  Data: gene expression, gene function and gene location.  Analysis: probability models, hierarchies.
Gene Expression Omnibus (GEO)
Copyright OpenHelix. No use or reproduction without express written consent1.
Exploring Microsoft Access Chapter 4 Relational Databases, External Data, Charts, and the Switchboard.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
1 st June 2006 St. George’s University of LondonSlide 1 Using UMLS to map from a Library to a Clinical Classification: Improving the Functionality of a.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Query Processing In Multimedia Databases Dheeraj Kumar Mekala Devarasetty Bhanu Kiran.
The desmoid tumor proteome: identifying molecular markers using a clinically annotated tissue microarray Shohrae Hajibashi, Wei-Lien Wang, Alexander J.F.
Using ontologies to make sense of unstructured medical data Nigam Shah, MBBS, PhD
Copyright OpenHelix. No use or reproduction without express written consent1.
Introduction to Databases Vetle I. Torvik. DNA was the 20 th century - Databases are the 21 st century 4 Quantum leaps in the evolution of human brain.
Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.
The EST database is a collection of short single-read transcript sequences from GenBank. These sequences provide a resource to evaluate gene expression,
Copyright OpenHelix. No use or reproduction without express written consent1.
BioInformatics Database of Primer Results In order to help predict the way proteins will act in an organism, biologists cross-examine sequences of amino.
Copyright OpenHelix. No use or reproduction without express written consent1.
Gene Expression Omnibus (GEO)
S calable K nowledge C omposition Ontology Interoperation January 19, 1999 Jan Jannink, Prasenjit Mitra, Srinivasan Pichai, Danladi Verheijen, Gio Wiederhold.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
MedKAT Medical Knowledge Analysis Tool December 2009.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Copyright OpenHelix. No use or reproduction without express written consent1.
Mapping to Ontologies Nigam Shah
Hypertext. Hypertext History (1) Many early attempts to organize human knowledge Many early attempts to organize human knowledge Thesaurus (Roget) Thesaurus.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Copyright OpenHelix. No use or reproduction without express written consent1.
Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.
Oncologic Pathology in Biomedical Terminologies Challenges for Data Integration Olivier Bodenreider National Library of Medicine Bethesda, Maryland -
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Concepts & the Concept Dictionary Regional East African Centre for Health Informatics (REACH-INFORMATICS) Lauren Stanisic July 2012 REACH-INFORMATICS,
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
VOCAB REVIEW. A field that can be computed from other fields Calculated field Click for the answer Next Question.
Describing and Annotating Experimental Data: Hands On.
Improving gene expression similarity measurement using pathway-based analytic dimension Changwon Keum BMDRC.
GEO (Gene Expression Omnibus) Deepak Sambhara Georgia Institute of Technology 21 June, 2006.
Department of Pathology UC Davis School of Medicine Jeff Gregg, M.D. The Development of an Informatics Platform for the Characterization of Clinical Samples.
Using NCBO Web services
Saccharomyces Genome Database (SGD)
Department of Genetics • Stanford University School of Medicine
Overview Gene Ontology Introduction Biological network data
POD #30 1/31/19 Write the rule for the following tables:
Gene expression informatics – it’s all in your mine
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Ontology-based Annotation & Query of TMA data Nigam Shah Stanford Medical Informatics

Tissue Microarrays

Stanford tissue microarray database

Key analysis issue  Tissue microarrays query a large number of samples/patients for one protein.  The key query dimension in TMA data is a tissue sample  Because of the lack of a commonly used ontology to describe the diagnosis [or annotations] for a given TMA sample in TMAD it is not easy to perform such as query.

Ontologies considered  The NCI Thesaurus, version 05.09g  The SNOMED-CT, from UMLS 2005 AA

Available annotations for a block  Each donor block in the TMA has semi- structured text associated with it. IDOrganDiagnosisSubclass 1Subclass 2Subclass 3Subclass OvaryMMMT 3335ProstateCarcinomaAdenointraductal 7022BladderCarcinomaTransitional cell In situ 7288TestisteratomaimmatureEmbryonal carcinoma 8060LiverCarcinomahepatocellularNo vascular invasion HepC cirrhosis 6662Soft tissueSarcomaLeiomyoepithelioid 6663lungSarcomaLeiomyoepithelioid 4713stomachcarcinomaunknown

Map text to ontology terms  Make all possible permutations  Rules to weed out bad permutations  Check for an exact match with NCI and SNOMED-CT terms (and/or synonyms)  Rules to weed out bad matches ProstateCarcinomaAdenointraductal 24 permutations Prostate Carcinoma Adeno intraductal : Carcinoma Prostate intraductal Adeno : Adeno Carcinoma intraductal Prostate : Prostate intraductal Adeno Carcinoma Prostate_Ductal_Adenocarcinoma

Sample matches (from NCI-T) OrganDiagnosisSubclass 1Subclass 2Subclass 3Ontology Terms 2334OvaryMMMTMalignant_Mixed_Mesodermal_Mullerian_Tu mor 3335ProstateCarcinomaAdenointraductalProstate_Ductal_Adenocarcinoma 7022BladderCarcinomaTransitional cell In situStage_0_Transitional_Cell_Carcinoma Transitional_Cell_Carcinoma Bladder_Carcinoma Carcinoma_in_situ 7288TestisteratomaimmatureEmbryonal carcinoma Immature|Teratoma Testicular_Embryonal_Carcinoma Immature_Teratoma 8060LiverCarcinomahepatocellularNo vascular invasion HepC cirrhosis Hepatocellular_Carcinoma 6662Soft tissueSarcomaLeiomyoepithelioidSoft_Tissue_Sarcoma Leiomyosarcoma Epithelioid_Sarcoma 6663lungSarcomaLeiomyoepithelioidLung_Sarcoma Leiomyosarcoma Epithelioid_Sarcoma 4713stomachcarcinomaunknownGastric_carcinoma

Results and validation  Mapped the term-sets for 8495 records, which correspond to 783 distinct term-sets.  577 term-sets (6614 records) matched to the NCI thesaurus  365 term-sets (3465 records) matched to SNOMED-CT  In total mapped 6871 records (80%) of annotated records in TMAD (641 distinct term-sets) to one or more ontology terms. Validation NCISNOMED-CT AppropriateInappropriateAppropriateInappropriate Set Set Set Total Average (%)43.0 (86%)7.0 (14%)40.66 (81%)9.33 (19%)

Browsing interface

Parents & Siblings nodes with data (Burly wood) Child nodes with data (Yellow) Child nodes with no data (Grey)

Click on the “anchor” link to get data

Updates since February

How do ontology based annotation help?  Better search: we can retrieve samples of all the retroperitoneal tumors or malignant uterine neoplasms for example.  Better Integration of data: we can correlate gene expression with protein expression across multiple tumor types.  Tissue microarray data from TMAD  Gene expression data from GEO

Integrating mRNA and protein expression Proteins Samples Genes Samples

Partial alignment of NCI-T and SNOMED-CT as a “bonus”

Steps in Alignment  Anchor identification  Identify similar class labels in the ontologies to be aligned  Usually done by string matching  Ontology structure  Use the “similar” classes as anchors and examine the local [graph] structure around them to inform the “similarity” metric Root Term-1 Term-2 Term-3 Term-4 Term-5 R t1 t2 t4 t5 t6 t7 t3

We might improve alignment … Root Term-1 Term-2 Term-3 Term-4 Term-5 R t1 t2 t4 t5 t6 t7 t3 Term-2 t1 Term-5 t5 Ontology [graph] structure based step Provide Anchors from annotated data S2 t5 Term-5 S2 t5 Term-5

Better Text-mapping  Better Alignment 2/177/ Distinct Terms Terms with NCI match Terms with SNOMEDCT match Terms with any match Terms with both match

Summary Ability to map word-groups to ontology terms

Credits and acknowledgements  Pathology  Robert Marinelli  Matt van de Rijn  Medical Informatics  Kaustubh Supekar  Daniel Rubin  Mark Musen  Funding  NIH