Detecting Phenotype-Specific Interactions Between Biological Processes Nadeem A. Ansari Department of Computer Science Wayne State University Detroit,

Slides:



Advertisements
Similar presentations
The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and Ben Lakin.
Advertisements

Bayesian Factor Regression Models in the “Large p, Small n” Paradigm Mike West, Duke University Presented by: John Paisley Duke University.
Dimensionality Reduction PCA -- SVD
Nucleic Acids - Informational Polymers
Nucleic Acids The amino acid sequence of a polypeptide is programmed by genes. Genes consist of DNA, which is a polymer belonging to the class of compounds.
Gene Ontology John Pinney
Introduction to Bioinformatics Yana Kortsarts Bob Morris.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
1 Genetics The Study of Biological Information. 2 Chapter Outline DNA molecules encode the biological information fundamental to all life forms DNA molecules.
Principal Component Analysis
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Data-intensive Computing: Case Study Area 1: Bioinformatics B. Ramamurthy 6/17/20151.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Singular Value Decomposition
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Prepared with lots of help from friends... Metsada Pasmanik-Chor, Zohar Yakhini and NUMEROUS WEB RESOURCES. BioInformatics / Computational Biology Introduction.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Assigning Numbers to the Arrows Parameterizing a Gene Regulation Network by using Accurate Expression Kinetics.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
13.3: RNA and Gene Expression
DNA Explained What we already know: The nucleus contains DNA Eukaryotes have linear DNA Prokaryotes have circular DNA DNA is copied during Interphase.
NUS CS5247 A dimensionality reduction approach to modeling protein flexibility By, By Miguel L. Teodoro, George N. Phillips J* and Lydia E. Kavraki Rice.
Chapter 2 Dimensionality Reduction. Linear Methods
CSE 6406: Bioinformatics Algorithms. Course Outline
Gene Set Enrichment Analysis (GSEA)
What must DNA do? 1.Replicate to be passed on to the next generation 2.Store information 3.Undergo mutations to provide genetic diversity.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Finding Mathematics in Genes and Diseases Ming-Ying Leung Department of Mathematical Sciences University of Texas at El Paso (UTEP)
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Statistical Testing with Genes Saurabh Sinha CS 466.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Introduction to biological molecular networks
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
Introduction of Genomic Nada Al-Juaid. Out line  Cell  DNA the molecule of life  Centra dogma  Gene  Genetics  Genome  Genomic  Epigenomic  Human.
Thursday, March 17 th Big Idea: What does DNA and RNA do for the cell? Daily target: I can explain DNA and how it models nucleic base pairing. Homework:
Clustering Manpreet S. Katari.
Tutorial#3.
Statistical Testing with Genes
Latent Semantic Indexing
LSI, SVD and Data Management
Nucleic Acids 1 1.
Transcription.
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
What is RNA? Do Now: What is RNA made of?
Bioinformatics Vicki & Joe.
Higher Biology Unit 1: 1.3 Transcription.
An Overview of Gene Expression
Recuperação de Informação B
Restructuring Sparse High Dimensional Data for Effective Retrieval
Statistical Testing with Genes
Latent Semantic Analysis
Presentation transcript:

Detecting Phenotype-Specific Interactions Between Biological Processes Nadeem A. Ansari Department of Computer Science Wayne State University Detroit, MI

Outline Biological background Motivation and problem description Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype Improvements Results Summary 2

Outline Biological background Motivation and problem description Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype Improvements Results 3

Cells, proteins, and DNA Cells: fundamental units of life that contain all the working machinery necessary for their functioning Proteins: the main contributors of this working machinery Deoxyribonucleic acid (DNA): contains the blueprint for making the working machinery Gene expression: the process of making the working machinery 4

DNA Linear molecule of two strands; each composed of subunits called Nucleotides Nucleotide types: Adenine – A Cytosine – C Guanine – G Thymine – T 5

DNA 6 Base pairing: … A A C G G A T … … T T G C C T A …

Transcription Information stored in DNA letters is transcribed into Ribonucleic acid (RNA) RNA: a chain of nucleotides - A, C, G, U (uracil) 7 … G T G C A T … DNA … C A C G U A … RNA

Translation 8 Information stored in RNA is translated into chains of amino acids - proteins

Gene expression The process of making the working machinery of a cell. 9

10 Regions of DNA that are synthesized into functional RNA and proteins are known as genes An observable characteristic (or trait) of an organism caused by gene expression is known as a phenotype.

Gene expression measurement – why? Various stimuli cause change in gene expression Change in expression level results in under or over production of working machinery – diseases / phenotypes All cells contain same DNA – express genes selectively 11 Measuring gene expression can help us understand underlying biological phenomenon

Gene expression measurements Typically researchers measure gene expression in two different tissues or cell samples – Cells treated with a drug vs. untreated cells Genes expressed differently than in a controlled sample are called differentially expressed (DE) genes High throughput technologies like DNA microarrays measure expression levels of thousands of genes 12

Genes and annotations Functional characteristics of gene products are stored in annotation databases like gene ontology Gene Ontology (GO): a controlled and structured vocabulary – Molecular functions, biological processes, and cellular components Structured as directed acyclic graphs (DAGs) – Nodes represent terms – Edges represent relationships Parent-child relations (more than one parent) – Is-a, part-of, and regulates (negatively, positively) 13

Biological processes – GO subset GO is a set of terms and their definitions organized in a structure that reflects their relationships GO also provides a set of annotations, describing what is known about each gene (products) 14

Outline Biological background Motivation and problem description Goals, Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype Improvements Results 15

Motivation and problem description Various stimuli cause differential gene expression, which results in the over and under production of proteins Over and under production of proteins can result in the expression of a disease and disease-specific phenotype Understanding genes behavior can help us understand diseases in ways never thought before – e.g. drug targets for curing diseases 16

Motivation and problem description Current approaches look for the biological functions that are under or over represented in the phenotype-specific gene expression patterns However, life is complex and biological functions also interact These interactions change in a phenotype Understanding changed interactions between biological functions is important in understanding the underlying biological mechanism that resulted in the phenotype 17

Outline Biological background Motivation and problem description Goals, Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype Improvements Results 18

Goals Our goal is to detect the interactions between biological functions that have changed significantly in a given phenotype We detect these interactions between the biological processes from GO annotated with differentially expressed genes in a phenotype 19

Challenges and limitations There is no simple way to establish which biological functions are important – No universally accepted statistical model exists Finding relationship between biological processes using mathematical models is challenging No known statistical model exists that detects changed interactions in a given phenotype Using GO annotations presents its own challenges 20

Challenges and limitations GO is incomplete and updated on continuous basis – Missing information regarding gene annotations GO contains inconsistencies – New research may make previous annotations obsolete GO hierarchy poses challenge of dependencies – Genes annotated with specific terms are assumed to be annotated with all the ascendants of the annotated term 21

Outline Biological background Motivation and problem description Goals, Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype Improvements Results 22

Information retrieval (IR) Problem: Given a query, find relevant documents from a collection Vector space model (VSM) – Represent document and keywords in a matrix Documents as columns with keywords as components – columns are document vectors – Represent query as a (column) vector – Find document vectors closer to query vector Documents are relevant to query 23

Example – document retrieval ADocument collection D1How to bake bread without recipes D2The classic art of Viennese pastry D3Numerical recipes: the art of scientific computing D4Breads, pastries, pies, and cakes: quality baking recipes D5Pastry: a book of best French recipes 24 Example taken from Berry et al., SIAM: Review 41, 2 (1999)

Example – document retrieval ADocument collection D1How to bake bread without recipes D2The classic art of Viennese pastry D3Numerical recipes: the art of scientific computing D4Breads, pastries, pies, and cakes: quality baking recipes D5Pastry: a book of best French recipes 25 T1T2T3T4T5T6 Terms bakerecipebreadcakepastrypie

Example – document retrieval ADocument collection D1How to bake bread without recipes D2The classic art of Viennese pastry D3Numerical recipes: the art of scientific computing D4Breads, pastries, pies, and cakes: quality baking recipes D5Pastry: a book of best French recipes A D1D1 D2D2 D3D3 D4D4 D5D5 T1T T2T T3T T4T T5T T6T T1T2T3T4T5T6 Termsbakerecipebreadcakepastrypie Term by document matrix

Example (IR VSM) A D1D1 D2D2 D3D3 D4D4 D5D5 T1T T2T T3T T4T T5T T6T T1T2T3T4T5T6 Termsbakerecipebreadcakepastrypie Query User searching for documents related to “baking bread” Query vector: Document vector:

Finding relevant (similar) documents 28 A D1D1 D2D2 …DnDn T1T1 a 11 a 12 …a 1n T2T2 a 21 a 22 …a 2n … ………… TmTm a m1 a m2 …a mn

Correlation 29 Determines if two random variables vary together Linear correlation between X and Y: – Positive correlation - X increases as Y increases – Negative correlation - X decreases as Y increases – No linear correlation - no linear relationship (Pearson correlation coefficient)

Pearson correlation coefficient – geometric interpretation 30

Outline Biological background Motivation and problem description Goals, Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype Improvements Results 31

Detecting interactions that have changed significantly in the phenotype Represent differentially expressed genes, in a phenotype, and their biological functions as a matrix – vector space model with biological processes as column vectors Find associations between pairs of biological processes Compare these associations with the corresponding associations in the absence of such phenotype Detect association that are significantly different in the phenotype 32

Data inputs - genes and functions Reference genes and functions set (R) – M genes on a microarray – N GO terms annotated with M genes In a biological condition under study (E) – m < M differentially expressed (DE) genes – n <= N GO terms annotated with m DE genes 33

Gene function matrix – reference data 34 GF f1f1 f2f2 …fNfN g1g1 a 11 a 12 …a 1N g2g2 a 21 a 22 …a 2N … ……… gMgM a M1 a M2 …a MN

Gene function matrix – reference data 35 Example gene-function matrix

Gene function matrix – experiment data 36 Example gene-function matrix

Gene function matrix – reference and experiment Data 37 Experiment gene-function matrix is subpart of reference gene-function matrix

Challenges and limitations GO is incomplete and updated on continuous basis – Missing information regarding gene annotations GO contains inconsistencies – New research may make previous annotations obsolete GO hierarchy poses challenge of dependencies – Genes annotated with specific terms are assumed to be annotated with all the ascendants of the annotated term 38

Our approach to solve challenges Use singular value decomposition (SVD) SVD can find missing relationships between genes and annotations in the latent semantic space and also remove noise from data Noise: multiple words describing the same concepts 39 SVD is a factorization of a matrix into three matrices consisting of singular vectors and singular values corresponding to the original matrix

Singular value decomposition (SVD) Columns of matrix G (F) are left (right) singular vectors of GF S is a diagonal matrix of singular values s i. – The values on the main diagonal are ordered in non- increasing order and represent variability in data 40 SVD of a GF matrix

Matrix approximation – dimensionality reduction An approximated matrix can be computed by keeping only the first k largest singular values 41 We select k that retains the desired data variance (say x%) using the equation:

Approximated matrix – column view 42 We approximate both reference and experiment matrices The approximated experiment gene-function matrix is not a sub-part of the approximated reference gene-function matrix

Correlation Between Functions Indicates the strength and direction of a linear relationship between two biological processes Pearson correlation coefficient r fi,fj between a pair of functions f i and f j is computed as: 43 Matrices (R R NxN and R E nxn ) of correlation coefficients are computed for reference and experiment data (respectively)

Pair-wise Correlation Coefficients for Reference and Experiment data R R nxn contains the pair-wise correlation coefficients between the first n functions in the absence of phenotype 44 =

Fisher Z Transform – Correlation Coefficient To Z-values Correlation coefficients from samples of large population can be mapped to z values using Fisher z-transform, which approximates normal distribution For a correlation coefficient r, the Fisher z- transform Z r can be computed as: 45 Compute Z R r from R R NxN and Z E r from R E nxn

Detecting Changes Between Functional Interactions Hypothesis: Correlation between two biological processes in the given phenotype differs from the correlation in the reference data 46 HypothesisTest statistic

Outline Biological background Motivation and problem description Goals, Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype Improvements Results 47

Improvements The dependencies between GO terms can somewhat be removed using weights in our matrix. 48

Scheme 1-1 This is a binary scheme and was discussed while describing our main method 49

Scheme 1-e e i is the normalized log-transformed fold- change measured for gene g i in the given condition 50

Scheme IR 1-1 gb: Gene (annotation) bias – GO DAG related iab: Inverse annotation bias – experiment related 51

Scheme IR 1-e 52

Outline Biological background Motivation and problem description Goals, Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype Improvements Results 53

Breast cancer data set Veer et al. (2002) found some differentially expressed genes in breast cancer – 24,000 reference genes on the microarray – 13,201 annotated biological processes from GO – 231 genes were found to be differentially expressed – 246 annotated biological processes with the DE genes Since then no satisfactory prediction has been made in this regard 54

Breast Cancer Data Set Results A subset of predicted biological pairs with significant interaction change SchemeGO Term 1GO Term 2p-value 1-1, IR 1-eProteolysisPositive regulation of apoptosis TranscriptionDNA replication initiation DNA repairRegulation of transcription, DNA- dependant.033 IR 1-1Vesicle-mediated transport Transcription from RNA polymerase II promoter.002 IR 1-1DNA replication initiation Phosphinositide- mediated signaling

Breast Cancer Data Set Results Summary Number of predicted biological pairs with significant interaction change SchemeCat. 1Cat. 2Cat. 3Accuracy % 1-e % IR % IR 1-e % Total % 56 Cat. 1: Known interactions and trivial Cat. 2: Known interactions and non-trivial Cat. 3: Unknown

Lung cancer data set Beer et al. (2002) found some differentially expressed genes in lung cancer – 5541 reference genes on the microarray – 2908 annotated biological processes from GO – 87 genes were found to be differentially expressed – 248 annotated biological processes with the DE genes 57

Lung Cancer Data Set Results Summary Number of predicted biological pairs with significant interaction change SchemeCat. 1Cat. 2Cat. 3Accuracy % 1-e % IR % IR 1-e % Total % 58

Summary Various stimuli cause differential gene expression, which results in the expression of a disease and disease-specific phenotype Biological processes interact and their interaction change in a given phenotype We proposed methods to detect such significantly changed interactions in the observed phenotype We used vector space model, matrix approximation, and statistical hypothesis testing to find changed interactions between biological processes from GO Results showed 89% or more accuracy for our proposed methods 59

References: Ansari, N. A., Bao, R., and Dr ă ghici, S. Detecting phenotype-specific interactions between biological processes from microarray data and annotations. Bioinformatics, under revision. Dr ă ghici, S. Data Analysis Tools for DNA Microarrays. Chapman and Hall/CRC Press, 203 (first print), 2006 (second print) Berry, M. W., Drmac, Z., and Jessup, R. E. Matrices, vectors spaces, and information retrieval. SIAM: Review 41, 2 (1999), Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 6 (1990), Done, B., Khatri, P., Done, A., and Dr ă ghici, S. Predicting novel human Gene Ontology annotations using semantic analysis. IEEE/ACM Transactions on CBB (2009) 60

Special Thanks to Dr. Sorin Draghici 61

Thank You 62