Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovering Gene-Disease Association using On-line Scientific Text Abstracts. Raj Adhikari Advisor: Javed Mostafa.

Similar presentations


Presentation on theme: "Discovering Gene-Disease Association using On-line Scientific Text Abstracts. Raj Adhikari Advisor: Javed Mostafa."— Presentation transcript:

1 Discovering Gene-Disease Association using On-line Scientific Text Abstracts. Raj Adhikari Advisor: Javed Mostafa

2 18 October 2015 Bioinformatics capstone project2 Motivation A central problem in bioinformatics is how to capture information from the vast scientific literature and create an automated system for “knowledge discovery” that can be used in various areas. I address the special case of gene-disease interactions and show that using the frequencies/relevance of words in Pubmed abstracts can be used to find genes related to a disease.

3 18 October 2015 Bioinformatics capstone project3 Goal Use the combination of statistical methods and a database to: retrieve research abstracts from Pubmed. extract relevant information from the free texts using statistical methods. Measure the accuracy of the results and display the results using a Web based system. Complement and support existing knowledge base systems like GeneCards.

4 18 October 2015 Bioinformatics capstone project4 Resources used in creating database PubMed The US National Library of Medicine's database that contains more than 11 million references to journal articles in the health sciences. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi GeneCards a database of human genes, their products and their involvement in diseases http://bioinfo.weizmann.ac.il/cards/index.shtml HGNC HUGO Gene Nomenclature Committee (approved over 19000 human gene symbols). consistent with OMIM and LocusLink http://www.gene.ucl.ac.uk/nomenclature Tools used: Perl, CGI, Java, MySQL

5 18 October 2015 Bioinformatics capstone project5 Creating the database Data I used: A relatively small list of genes and diseases in humans An article set (around 8000) For each Pubmed article: PMID Article Title Abstract (filter with a list of stop words) The HUGO dataset. List of around 3500 related gene-disease pairs from GeneCards.

6 18 October 2015 Bioinformatics capstone project6 Populating the database tables Use the book Genes and Disease at OMIM to generate a list of around 60 diseases and 90 genes. Search Pubmed for each gene-disease pair on the Title/Abstract field. Use ESearch (tool that provides access to Pubmed database outside of the web interface) to retrieve data in XML file format. Use XML::Simple Perl package to parse the XML file Filter the text using stop words and store each title and abstract along with the related PMID in a database table. Add more genes using HUGO OMIM: Database of genetic diseases with references to molecular medicine, cell biology, biochemistry and clinical details of the diseases.

7 18 October 2015 Bioinformatics capstone project7 Populating the database tables Table structures: Derivative table Parse the retrieved text files and create the following tables: HUGO table structure: GeneCards table structure: HGNCgenesymbolalias Genesymboldisease TermPMIDTfreqDfreqTfidfLSI

8 18 October 2015 Bioinformatics capstone project8 Generating term weights Basic idea: compare co-occurrence of terms in a document and across a set of documents by generating term weights. Within a document: Term-Frequency tf measures term density within a document. Across the document set: Inverse Document Frequency idf measures the “informativeness” of a term across a dataset. Thus:

9 18 October 2015 Bioinformatics capstone project9 Latent Symantec Indexing Calculating co-occurrence of terms might not suffice because of possible “noise” in the dataset. Use LSI, a statistical technique, to estimate a latent structure. Assume some underlying semantic structure in the dataset which could be partially obscured. Implementation term by document matrix (tends to be sparse) convert matrix entries to weights, e.g. tfidf. Analyze the matrix by singular value decomposition (SVD) to derive latent semantic structure model.

10 18 October 2015 Bioinformatics capstone project10 SVD unique mathematical decomposition of a matrix into the product of three matrices: two with orthonormal columns one with singular values on the diagonal finds optimal projection into low-dimensional space tool for dimension reduction

11 18 October 2015 Bioinformatics capstone project11 SVD Singular Value Decomposition {A}={U}{E}{V}T Where: {U} has orthonormal, unit length columns: {U}{U}’ = I {E} is the diagonal matrix of positive real numbers {V} has orthonormal, unit length columns: {V}{V}’ = I

12 18 October 2015 Bioinformatics capstone project12 SVD Approximate A k keeping only the first k singular values and the corresponding columns from U and V matrices. The new matrix A k does not exactly match the original term by document matrix A. (It gets closer and closer as more singular values are kept). This is what we want: we don’t want perfect fit since we think some of the 0’s in A should be not be 0 and vice versa. Limitations of SVD – very memory intensive, cannot handle large datasets.

13 18 October 2015 Bioinformatics capstone project13 Scoring Matrix Generation A scoring matrix is generated for each term weighting method using the data stored in the database. This matrix is used to find the relationships between genes and diseases. Relatively fast process since the weights are pre-computed and stored in a database.

14 18 October 2015 Bioinformatics capstone project14 Finding relationships T1T2T3…Tn D111 D211 …10 Dn10 T1T2T3…Tn T12 T2 … Tn Use the doc-term matrix to establish relationships between genes and disease

15 18 October 2015 Bioinformatics capstone project15 Results

16 18 October 2015 Bioinformatics capstone project16 Verification of the relationship Data from GeneCards and HUGO has been stored in a database. For each gene, if the symbol is an official genesymbol (according to HUGO), then search for the genesymbol in GeneCards and display the disease associated with it. Else (if the symbol is an alias), use HUGO to find the official genesymbol and search in GeneCards using this genesymbol and display the disease associated with the gene.

17 18 October 2015 Bioinformatics capstone project17 Verification results

18 18 October 2015 Bioinformatics capstone project18 Using gene alias Make use of gene alias from HUGO to increase the chances of detecting correct genes for a given disease Method: Increment the weight of an official gene by adding the weight of the of the alias. Group the alias together with the official gene.

19 18 October 2015 Bioinformatics capstone project19 Results for Pancreatic Cancer Top five genes – without considering alias Top five genes – considering alias

20 18 October 2015 Bioinformatics capstone project20 Using gene alias - problems Problem: HUGO might have multiple official gene symbols for some alias: This particular alias could actually increase the weight of a gene that is not related to the disease. Example: 3585FANCD2FAD, FA-D2 1101BRCA2FAD, FAD1 9508PSEN1FAD, S182, PS1

21 18 October 2015 Bioinformatics capstone project21 Problem using alias

22 18 October 2015 Bioinformatics capstone project22 Verification In addition, the number of Pubmed articles containing a disease and a gene symbol can be an indication of how strong the association between a disease and a gene is. Same theory applies for a gene-gene relationship.

23 18 October 2015 Bioinformatics capstone project23 In addition, we can use the doc-term matrix to find gene(s) that are related to any given gene. Using the matrices above, we see that g2 is related to g3 and the weight is 2. Gene-Gene Relationships g1g2g3…gn D1111 D2111 …101 Dn100 g1g2g3…gn g1 g22 … gn

24 18 October 2015 Bioinformatics capstone project24 Discovering additional gene- gene relationships We can make use of the possibility that two genes might be related to each other via a disease as in: gene1 -> disease1 -> gene2 gene1 -> disease2 -> gene2 to establish relationships between gene1 and gene2. In our case, the fact that gene1 and gene2 are related to each other via two different diseases makes the relationship between them even stronger.

25 18 October 2015 Bioinformatics capstone project25 Architecture

26 18 October 2015 Bioinformatics capstone project26 System Demonstration http://biokdd.informatics.indiana.edu/radhika r/search.html http://biokdd.informatics.indiana.edu/radhika r/search.html Related URLs: Genecards: http://bioinfo.weizmann.ac.il/cards/index.shtml HGNC: http://www.gene.ucl.ac.uk/nomenclature/

27 18 October 2015 Bioinformatics capstone project27 Summary Using the combination of statistical methods and a database, the process of establishing gene-disease relationship using literature data is fast and efficient. With minimal changes, our system can be extended to discover other relationships like protein-protein interactions, etc.

28 18 October 2015 Bioinformatics capstone project28 Future Work Extend our system to incorporate the entire Medline dataset. Incorporate full gene names. Find a better way to verify the gene-gene relationships. Incorporate other On-Line scientific literature databases.

29 18 October 2015 Bioinformatics capstone project29 Acknowledgments Professor Javed Mostafa Professor Sun Kim Professor Memo Dalkilic Professor Haixu Tang


Download ppt "Discovering Gene-Disease Association using On-line Scientific Text Abstracts. Raj Adhikari Advisor: Javed Mostafa."

Similar presentations


Ads by Google