1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.

Slides:



Advertisements
Similar presentations
Discovering Disease Associations using a Biomedical Semantic Web: Integration and Ranking One of the principal goals of biomedical research is to elucidate.
Advertisements

Bioinformatics growth curves Medline records Computer power DNA sequences 3-D structures.
Gene- specific DB Disease- specific DB "I don't care other genes (pathways). Any disease welcome, as long as relevant to my gene (pathway)." "I don't care.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
Zhen Shi June 2, 2010 Journal Club. Introduction Most disease-causing mutations are thought to confer radical changes to proteins (Wang and Moult, 2001;
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
COG and GO tutorial.
Richard, Rochelle, Zohal, Angie
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.
Michael Cummings David Reisman University of South Carolina Genomes and Genomics Chapter 15.
Course Module: Introduction to Bioinformatics – CS 2001 July CS Databases.
Dermatology 2006 SNU Dermatolory Lab Bioinformatics for Genomic Medicine 2006 Dermatology Lab Yoonkyung Kim 0 Term Project Proposal Presentation 2006.
On line (DNA and amino acid) Sequence Information
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
Tae-Hyung Kim 1 Gil-Mi Ryu 1,2 InSong Koh 2 Jong Park 3 1.
Exploring Current DNA Research of Longhorn Cattle.
Representing, Querying and Mining Knowledge about Autism Phenotypes
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
Bioinformatics and medicine: Are we meeting the challenge?
AuthorLink: Instant Author Co-Citation Mapping for Online Searching Xia Lin Howard D. White Jan Buzydlowski Drexel University Philadelphia,
Temporal Analysis of Platelet Data in Chronic Viral Hepatitis Dataset Shoji HiranoShusaku Tsumoto Department of Medical.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan.
HUMAN-MOUSE CONSERVED COEXPRESSION NETWORKS PREDICT CANDIDATE DISEASE GENES Ala U., Piro R., Grassi E., Damasco C., Silengo L., Brunner H., Provero P.
CS177 Lecture 10 SNPs and Human Genetic Variation
Online Mendelian Inheritance in Man (OMIM): What it is & What it can do for you Knowledge Management & Eskind Biomedical Library January 27, 2012 helen.
University of Crete Department of Computer Science ΗΥ-561 Web Data Management XML Data Archiving Konstantinos Kouratoras.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
Merge links between probes by Entrez Gene identifiers Genes and proteins of living organisms deploy their functions through a complex series of interactions.
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
An overview of Bioinformatics. Cell and Central Dogma.
More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?
Phenotype And Trait Ontology (PATO) and plant phenotypes
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Copyright OpenHelix. No use or reproduction without express written consent1.
On Using SIFT Descriptors for Image Parameter Evaluation Authors: Patrick M. McInerney 1, Juan M. Banda 1, and Rafal A. Angryk 2 1 Montana State University,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Catalog of human genes and genetic disorders Online version of the book Mendelian Inheritence in Man maintained by Johns Hopkins University and located.
Copyright OpenHelix. No use or reproduction without express written consent1.
Results for all features Results for the reduced set of features
Evaluating classifiers for disease gene discovery
Genomes and Their Evolution
Deep Phenotyping for Deep Learning (DPDL): Progress Report
Information Organization: Clustering
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
The student is expected to: 6A identify components of DNA, and describe how information for specifying the traits of an organism is carried in the DNA.
DNA to Genes to Genomes J.W. Prokop et al Physiological Genomics  2018, 50,
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Integrating human omics data to prioritize candidate genes
The Content of the Genome
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Evaluating Classifiers for Disease Gene Discovery
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Global analysis of the chemical–genetic interaction map.
Presentation transcript:

1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal of Human Genetics (2006) 14, Centre for Molecular and Biomolecular Informatics, Radboud University Nijmegenthe Netherlands; 2 Department of Bioinformatics, Wageningen University and Research Centre; 3 Department of Human Genetics, University Medical Centre Nijmegen Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan and Hsin-His Chen

2 Outline Introduction Methods Results Discussion

3 Introduction Functional annotation of genes is an important challenge once the sequence of a genome has been completed. Previous studies have correlated various attributes of human genes with the chance of causing a disease.

4 Introduction (cont.) But, few attempts have been made to systematically classify relationships between genes and proteins at the phenotype level.

5 Introduction (cont.) The Online Mendelian Inheritance in Man (OMIM) database contains human disease phenotype data and record-based textual information, one gene or one genetic disorder per record. Goal: Systematic grouping of genes by their associated phenotypes from the OMIM database.

6 Methods – The OMIM database Full text (TX) field: 5132 (disease)/16357

7 Methods – The OMIM database (cont.) Clinical synopsis (CS) field

8 Creation of ‘feature vectors’ MeSH terms and their components are concepts. MeSH concepts serve as phenotype features characterizing OMIM records. Ex: OMIM_1->[MeSH_1,MeSH_2,…]

9 Refinement of the feature vectors MeSH concepts can be very broad like ‘Eye’ or more specific like ‘Retina’. A concepts hierarchy that describes relationships such as ‘Eye’-’Retina’- ’Photoreceptors’. Retina is a hyponym of Eye.

10 Refinement of the feature vectors (cont.) To ensure that the concepts eye and retina are recognized as similar, the MeSH hierarchy was used to encode this similarity in the feature vectors by increasing the value of all hypernyms. r c : relevance of concept c r c,counted : count of the concept c in a document r hypo’s : relevance of the concept c’s hyponym n hypo,c : the number of the concept c’s hyponyms

11 Refinement of the feature vectors (cont.) Example of concept expansion using the MeSH hierarchical structure.

12 Refinement of the feature vectors (cont.) Not all concepts in the OMIM records are equally informative. Ex: ‘retina pigment epithelium’ occurs rarely, and thus provides more specific information than very frequently terms such as ‘Brain’. Inverse document frequency measure gw c : inverse document frequency or global weight of concept c N: 5080 n c : the number of records that contain concept c

13 Refinement of the feature vectors (cont.) Not all OMIM records contain equally extensive descriptions (record size differences). These differences will make a comparison between records difficult because the diversity and the frequency of concepts in the larger records will exceed those in the smaller records. rc: relevance of concept c r mf : the frequency of the most occurring MeSH concept in that record

14 Comparing OMIM records The similarity between OMIM records can be quantified by comparing the feature vectors that are expanded and corrected. Similarities between feature vectors were determined by the cosines of their angles. s(X,Y): the similarity between the feature vectors X and Y x i, y i : concept frequencies

15 Results – Comparing OMIM records 5080/5132 OMIM records could match one or more MeSH terms. The 5080x5080 pair-wise feature vector similarities form phenomap (All to all similarities). Most phenotype- phenotype pairs have a low similarity score.

16 Comparing OMIM records - The best scores for all phenotypes in the disease phenotype data set For each OMIM record, the most similar of the other 5079 records was identified. Moderately similar phenotype pairs might still yield reasonable hypotheses. Ex: ‘Fibromuscular Dysplasia of Arteries’ and ‘Cardiomyopathy, Familial Hypertrophic’ have 0.31 similarity score

17 Comparing OMIM records (cont.) Conclusion: The more phenotypes resemble each other, the more likely they are to share an interaction.

18 Discussion Developed a text-mining approach to map relationships between more than 5000 human genetic disease phenotypes from the OMIM database. Phenotype clustering reflects the modular nature of human disease genetics. Thus, the phenomap may be used to predict candidate genes for diseases.