Download presentation
1
Takeda Pharmaceutical Inc.
Integrating in Vitro Drug Sensitivity and Genomics Data for Identification of Novel Drug Pathway Associations Cong Li and Ray Liu Yale University and Takeda Pharmaceutical Inc. May 19, 2015 Presented at MBSW Muncie, IN, USA
2
Introduction Interests: indication selection; patient selection.
Experiments : cell lines drugs response assay; microarray assay Data: IC50 and microarray gene expression current analysis practice: stepwise and test-based Our goal: develop a method that analyze available data jointly and incorporate biological information Gene Change in gene expression IC50 Drug Drug Cell line Cell line
3
Microarrays
5
Questions Often Asked Design issues
Which genes are differentially expressed between the conditions? Which genes can be used to classify/predict? How? Can biological networks be inferred from these data? What are the biological stories in the data?
6
Drug Pathway Questions
Current drug development framework typically considers the effect of a compound on a single target Pathway-based approaches for drug discovery consider the therapeutic effects of compounds in the global physiological environment For many compounds, their target pathways and mechanism of action are still unknown How to infer the target pathways of drugs?
7
Motivating Data Sets Gene expression data: Affymetrix U133+2 arrays, mapped to ~19,000 genes across over 1000 cancer cell lines; among them, 480 cell lines have available drug response data. Use genes included in two lists: (1) 766 cancer-related genes (Chen, et al., 2008); (2) 8919 genes from the Integrated Druggable Genome Database (IDGD) Project (Hopkins and Groom, 2002; Russ and Lampel, 2005). Pathway association information: Retrieved from the KEGG MEDICUS database (Kanehisa, et al., 2010). 58 pathways which are either known to be related to cancer or have drug targets. Among the genes selected in step (1), 1863 genes are covered by these 58 pathways and constitute the final list of genes in our real data analysis. Drug response data: 24 drugs annotated in the CancerResource database (Ahmed, et al., 2011). log(Activity Area). 22 drugs with known targets covered by the 58 pathways.
8
Overview of the 22 drugs
9
Activity Area (shaded area)
Activity area is a combined measure of both drug potency and drug efficacy, whereas GI50 only measures drug potency.
10
Data Format Drug sensitivity values Basal gene expression levels
(e.g. Activity Area or GI50) Basal gene expression levels (before drug treatment) Cell line 1 Cell line 2 Cell line 3 … .. gene1 gene2 gene3 gene4 …….. drug drug drug3
11
Model Description Spike-and-Slab mixture prior (West, 2003) for the factor loading matrix W1 and W2 to impose sparsity and utilize prior knowledge on gene-pathway and drug-pathway associations (matrix L1 and L2).
13
Instead of adopting a full Bayesian treatment, we use the following integrative Penalized Matrix Decomposition (iPaD) framework Note the notation differences from iFad: Y(1) is the drug response profile matrix Y(2) is gene expression profile matrix X is the pathway activity level matrix B(1) and B(2) are the pathway loading matrices for drug responses and gene expressions respectively The indexes of the non-zero elements in B(2) are known and denoted by Γ The major interest is to find the non-zero elements in B(1)
14
The algorithm The optimization problem in iPaD is actually a bi-convex problem, motivating the following block-wise optimization strategy: Step 1. Optimize over B(1) and B(2) while keeping X fixed Step 2. Optimize over X while keeping B(1) and B(2) fixed Step 3. Iterate between Step 1 and 2 until convergence
15
The algorithm When X is fixed, optimizing each column of B(1) is a LASSO problem When X is fixed, optimizing each column of B(2) is an ordinary least square (OLS) problem When B(1) and B(2) are fixed, X can optimized using an iteratively projected gradient descent algorithm
16
Dealing with missing values
A gene/drug or cell line that is completely missing can be excluded However, partially missing genes/drugs or cell lines shall be kept in the analysis In our block-wise algorithm, B(1) and B(2) can be optimized column by column with the missing values excluded However, optimizing X is less straightforward because neither its rows nor columns can be optimized separately
17
We use the following soft-impute algorithm to optimize X in the presence of missing values
Ω indexes the observed elements in a matrix and PΩ(*) is an operator that projecting a matrix onto the space of its observed elements.
18
Parameter tuning Significance test
There is a parameter λ that controls the sparsity of B(1) One way to use the method is to apply a decreasing sequence of λ’s to obtain a sequence of solutions for B(1) We can also perform cross-validation on the drug response profile matrix Y(1) Green: training data; Black: testing data Significance test After finding an appropriate λ value, we can perform permutation tests to establish the significance of the identified drug-pathway associations Permute the cell lines (rows) in Y(1) while keeping Y(2) unchanged
19
Simulations We performed the following four sets of simulations (the 58 pathways in the real data were used; the number of drugs d = 22) N η SNR1 SNR2 Sample Size 120 0.1 0.5 240 360 480 Sparsity of B(1) 0.02 0.05 0.2 Signal-to-Noise Ratio 0.25 1 Unbalanced Signal-to-Noise Ratio The simulated data sets were analyzed by both iFad and iPaD. Their performances were evaluated by Area Under the ROC curve (AUC)
20
The performances between the two methods are similar
21
However, iPaD is much faster
The performances between the two methods are similar (cont.) However, iPaD is much faster 1000 iteration in iFad costs 4~5 days Solving a sequence of λ’s takes only ~6 minutes
22
Real Data Analysis We analyzed the CCLE data set described earlier with both iFad and iPaD iFad: 2,000 MCMC iterations; iPaD: 10-fold CV followed by 2,000 permutations (null distribution was approximated using a mixture of a normal distribution and a point mass at zero) We call a drug-pathway association validated if the pathway contains at least one protein targeted by the drug Among the 58 x 22 = 1276 drug-pathway pairs, 195 pairs are validated associations (195/1276 = 15.3%) Considering the randomness in the algorithms, we ran five repeats Among the top 50 drug-pathway association pairs identified by iFad, 7.0 (averaged over five repeats) pairs were validated; 16.6 for iPaD The top associations identified by iPaD were relatively consistent over the five repeats; but not consistent for iFad (probably did not converge) Running time: 2,000 MCMC iterations cost ~230 hours on a standard laptop computer (2.4GHz dual core CPU with 8G memory running on Mac OS X 10.9); 2,000 permutations cost ~6 hours for iPaD
23
The Chronic Myeloid Leukemia Pathway
24
The ErbB Signaling Pathway
25
Limitations/Future Work
Relatively simple additive models Limited and unreliable information on pathways Pathway network topology not considered Other sources of information Tradeoff between model simplicity, computational feasibility, and real biological complexity
26
Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.