Discovering Gene-Disease Association using On-line Scientific Text Abstracts. Raj Adhikari Advisor: Javed Mostafa.

Slides:



Advertisements
Similar presentations
PubMed/How to Search, Display, Download & (module 4.1)
Advertisements

In the Format section, we have activated the Bibliographic style drop down menu. From this page, you can choose a specific journal or format (e.g. BMC.
Zoology 305 Library Databases/Indexes Lab Goals for session: 1) Meet your librarian Kevin Messner 2) Understand.
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Dimensionality Reduction PCA -- SVD
Introduction to PubMed® (pubmed.gov)
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
NATIONAL LIBRARY OF MEDICINE The PubMed ID and Entrez, PubMed and PubMed Central Edwin Sequeira National Center for Biotechnology Information June 21,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
Information Retrieval in Practice
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Overview of Search Engines
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
PubMed/How to Search, Display, Download & (module 4.1)
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Datamining MEDLINE for Topics and Trends in Dental and Craniofacial Research William C. Bartling, D.D.S. NIDCR/NLM Fellow in Dental Informatics Center.
Chapter 2 Dimensionality Reduction. Linear Methods
1 How to find literature - A very short introduction SMED 8004 Medicine and Health Library October 2014.
BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic.
PubMed/How to Search, Display, Download & (module 4.1)
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
PubMed Overview From the HINARI Content page, we can access PubMed by clicking on Search inside HINARI full-text using PubMed. Note: If you do not properly.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
生物資訊程式語言應用 Part 5 Perl and MySQL Applications. Outline  Application one.  How to get related literature from PubMed?  To store search results in database.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Authors: Rosario Sotomayor, Joe Carthy and John Dunnion Speaker: Rosario Sotomayor Intelligent Information Retrieval Group (IIRG) UCD School of Computer.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Accessing journals by via PubMed Note the link to find articles through HINARI/PubMed. Using this option will be covered in later in the Short Course.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
SINGULAR VALUE DECOMPOSITION (SVD)
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
Vector Space Models.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Bioinformatics and Computational Biology
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Natural Language Processing Topics in Information Retrieval August, 2002.
Microsoft Office 2013 Try It! Chapter 4 Storing Data in Access.
DISCUSSION Using a Literature-based NMF Model for Discovering Gene Functional Relationships Using a Literature-based NMF Model for Discovering Gene Functional.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
© 2005 Bioinformatics Indiana University April, ::: Troy Campbell Advisors: Mehmet Dalkilic, Informatics Claudia Johnson, Paleontology Erika Elswick,
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Information Retrieval in Practice
Best pTree organization? level-1 gives te, tf (term level)
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Design open relay based DNS blacklist system
PubMed Database Interface (Basic Course: Module 4)
Latent Semantic Analysis
Presentation transcript:

Discovering Gene-Disease Association using On-line Scientific Text Abstracts. Raj Adhikari Advisor: Javed Mostafa

18 October 2015 Bioinformatics capstone project2 Motivation A central problem in bioinformatics is how to capture information from the vast scientific literature and create an automated system for “knowledge discovery” that can be used in various areas. I address the special case of gene-disease interactions and show that using the frequencies/relevance of words in Pubmed abstracts can be used to find genes related to a disease.

18 October 2015 Bioinformatics capstone project3 Goal Use the combination of statistical methods and a database to: retrieve research abstracts from Pubmed. extract relevant information from the free texts using statistical methods. Measure the accuracy of the results and display the results using a Web based system. Complement and support existing knowledge base systems like GeneCards.

18 October 2015 Bioinformatics capstone project4 Resources used in creating database PubMed The US National Library of Medicine's database that contains more than 11 million references to journal articles in the health sciences. GeneCards a database of human genes, their products and their involvement in diseases HGNC HUGO Gene Nomenclature Committee (approved over human gene symbols). consistent with OMIM and LocusLink Tools used: Perl, CGI, Java, MySQL

18 October 2015 Bioinformatics capstone project5 Creating the database Data I used: A relatively small list of genes and diseases in humans An article set (around 8000) For each Pubmed article: PMID Article Title Abstract (filter with a list of stop words) The HUGO dataset. List of around 3500 related gene-disease pairs from GeneCards.

18 October 2015 Bioinformatics capstone project6 Populating the database tables Use the book Genes and Disease at OMIM to generate a list of around 60 diseases and 90 genes. Search Pubmed for each gene-disease pair on the Title/Abstract field. Use ESearch (tool that provides access to Pubmed database outside of the web interface) to retrieve data in XML file format. Use XML::Simple Perl package to parse the XML file Filter the text using stop words and store each title and abstract along with the related PMID in a database table. Add more genes using HUGO OMIM: Database of genetic diseases with references to molecular medicine, cell biology, biochemistry and clinical details of the diseases.

18 October 2015 Bioinformatics capstone project7 Populating the database tables Table structures: Derivative table Parse the retrieved text files and create the following tables: HUGO table structure: GeneCards table structure: HGNCgenesymbolalias Genesymboldisease TermPMIDTfreqDfreqTfidfLSI

18 October 2015 Bioinformatics capstone project8 Generating term weights Basic idea: compare co-occurrence of terms in a document and across a set of documents by generating term weights. Within a document: Term-Frequency tf measures term density within a document. Across the document set: Inverse Document Frequency idf measures the “informativeness” of a term across a dataset. Thus:

18 October 2015 Bioinformatics capstone project9 Latent Symantec Indexing Calculating co-occurrence of terms might not suffice because of possible “noise” in the dataset. Use LSI, a statistical technique, to estimate a latent structure. Assume some underlying semantic structure in the dataset which could be partially obscured. Implementation term by document matrix (tends to be sparse) convert matrix entries to weights, e.g. tfidf. Analyze the matrix by singular value decomposition (SVD) to derive latent semantic structure model.

18 October 2015 Bioinformatics capstone project10 SVD unique mathematical decomposition of a matrix into the product of three matrices: two with orthonormal columns one with singular values on the diagonal finds optimal projection into low-dimensional space tool for dimension reduction

18 October 2015 Bioinformatics capstone project11 SVD Singular Value Decomposition {A}={U}{E}{V}T Where: {U} has orthonormal, unit length columns: {U}{U}’ = I {E} is the diagonal matrix of positive real numbers {V} has orthonormal, unit length columns: {V}{V}’ = I

18 October 2015 Bioinformatics capstone project12 SVD Approximate A k keeping only the first k singular values and the corresponding columns from U and V matrices. The new matrix A k does not exactly match the original term by document matrix A. (It gets closer and closer as more singular values are kept). This is what we want: we don’t want perfect fit since we think some of the 0’s in A should be not be 0 and vice versa. Limitations of SVD – very memory intensive, cannot handle large datasets.

18 October 2015 Bioinformatics capstone project13 Scoring Matrix Generation A scoring matrix is generated for each term weighting method using the data stored in the database. This matrix is used to find the relationships between genes and diseases. Relatively fast process since the weights are pre-computed and stored in a database.

18 October 2015 Bioinformatics capstone project14 Finding relationships T1T2T3…Tn D111 D211 …10 Dn10 T1T2T3…Tn T12 T2 … Tn Use the doc-term matrix to establish relationships between genes and disease

18 October 2015 Bioinformatics capstone project15 Results

18 October 2015 Bioinformatics capstone project16 Verification of the relationship Data from GeneCards and HUGO has been stored in a database. For each gene, if the symbol is an official genesymbol (according to HUGO), then search for the genesymbol in GeneCards and display the disease associated with it. Else (if the symbol is an alias), use HUGO to find the official genesymbol and search in GeneCards using this genesymbol and display the disease associated with the gene.

18 October 2015 Bioinformatics capstone project17 Verification results

18 October 2015 Bioinformatics capstone project18 Using gene alias Make use of gene alias from HUGO to increase the chances of detecting correct genes for a given disease Method: Increment the weight of an official gene by adding the weight of the of the alias. Group the alias together with the official gene.

18 October 2015 Bioinformatics capstone project19 Results for Pancreatic Cancer Top five genes – without considering alias Top five genes – considering alias

18 October 2015 Bioinformatics capstone project20 Using gene alias - problems Problem: HUGO might have multiple official gene symbols for some alias: This particular alias could actually increase the weight of a gene that is not related to the disease. Example: 3585FANCD2FAD, FA-D2 1101BRCA2FAD, FAD1 9508PSEN1FAD, S182, PS1

18 October 2015 Bioinformatics capstone project21 Problem using alias

18 October 2015 Bioinformatics capstone project22 Verification In addition, the number of Pubmed articles containing a disease and a gene symbol can be an indication of how strong the association between a disease and a gene is. Same theory applies for a gene-gene relationship.

18 October 2015 Bioinformatics capstone project23 In addition, we can use the doc-term matrix to find gene(s) that are related to any given gene. Using the matrices above, we see that g2 is related to g3 and the weight is 2. Gene-Gene Relationships g1g2g3…gn D1111 D2111 …101 Dn100 g1g2g3…gn g1 g22 … gn

18 October 2015 Bioinformatics capstone project24 Discovering additional gene- gene relationships We can make use of the possibility that two genes might be related to each other via a disease as in: gene1 -> disease1 -> gene2 gene1 -> disease2 -> gene2 to establish relationships between gene1 and gene2. In our case, the fact that gene1 and gene2 are related to each other via two different diseases makes the relationship between them even stronger.

18 October 2015 Bioinformatics capstone project25 Architecture

18 October 2015 Bioinformatics capstone project26 System Demonstration r/search.html r/search.html Related URLs: Genecards: HGNC:

18 October 2015 Bioinformatics capstone project27 Summary Using the combination of statistical methods and a database, the process of establishing gene-disease relationship using literature data is fast and efficient. With minimal changes, our system can be extended to discover other relationships like protein-protein interactions, etc.

18 October 2015 Bioinformatics capstone project28 Future Work Extend our system to incorporate the entire Medline dataset. Incorporate full gene names. Find a better way to verify the gene-gene relationships. Incorporate other On-Line scientific literature databases.

18 October 2015 Bioinformatics capstone project29 Acknowledgments Professor Javed Mostafa Professor Sun Kim Professor Memo Dalkilic Professor Haixu Tang