Modeling Protein Structures and Gene Regulatory Networks by Mining Protein and RNA-Seq Data Jianlin Jack Cheng, PhD Computer Science Department University of Missouri, Columbia March 7, 2012
Two Challenges Protein Structure Modeling Gene Regulatory Network Modeling
The Genomic Era Collins, Venter, Human Genome, 2000
Sequencing Revolution $1000 Personal Genome in 2010s Transcriptome Proteome
Genome Implications to Information Sciences and Life Sciences Elements and Systems
Growth of Protein Sequences AGCWY…
Growth of Protein Structures in PDB
Computational Protein Structure Folding / Prediction Structure = f ( sequence) ? E = MC 2
Template-Based Approach MWLKKFGINKH… Protein Data Bank Fold Recognition Alignment Template Target protein Protein structure space is limited! Chothia, Nature,1992 Protein sequence space is astronomical!
Fisher, 2005 Modeller
Template-Free Protein Structure Prediction
Template-Free Approach Sampling: MCMC and Simulated Annealing MWLKKFGINLLIGQSV… …… Select structure with minimum free energy Simulation
Major Challenges in Protein Structure Prediction Select best templates? Generate best alignments? Generate best models? Select best models? Pick a needle in a stack of hay!
Major Challenges in Protein Structure Prediction Select best templates? Generate best alignments? Generate best models? Select best models? Wang, Eickholt, Cheng, Bioinformatics, 2010
A Conformation Ensemble Approach P(conformation) P(-energy) Conformation Distribution Maximum Likelihood & Maximum a Posterior Brooks et al., 2001
New Views on Protein Modeling Protein structure modeling problem is simply a grand computational and statistical sampling problem. Random sampling (template-free) Targeted sampling (template-based)
MARTCRKE… Input Query 1. Template Ranking Alignments MARTCRKEGAP-WY… Y-RMH-RDGM-MWT… MAR-TCRK-EGAPWY… TAKMTHK-DEGFGYW… Query-Template 1 Query-Template Combination MAR-TCRK-EGAP-WY… Y-R-MH-R-DGM-MWT… TAKMTHK-DEGFG-YW… Model Generation 2. Multiple-Template Combination 4. Evaluation & Refinement Output A Unified Protein Structure Prediction Pipeline Wang et al., Bioinformatics, 2010
Sampling in Alignment and Fold Space PSI-BLAST (sequence – profile) SAM (sequence – HMM) HMMer (sequence – HMM) Compass (profile – profile) HHSearch (HMM - HMM) PRC (HMM-HMM) FOLDpro (machine learning) MSACompro (profile-profile) Cheng, Baldi, Bioinformatics, 2006 Deng, Cheng, BMC Bioinformatics, 2011
Multi-Template Combination in Template and Alignment Space Query VR-RNNMGMPLIESSSYHDALFTLGYAGDRISQMLGMRYANNLHDLFLAEGYYEASQRKR Temp1 IAHIYANNLHDLFLAEGYYEASQRLFEIEL FGLMGN LSSWVGA ( ) Temp2 LLAQ-GRLSEMAGADALDVNIYIDSNG ( ) Temp3 QGTARDRAWQLEVERHRAQGTSASFL ( ) Temp4 AANQLDAMRALGYAQERYFEMDLMRRAPAGELSELFGAKAVDLK (10 -5 ) Cheng, BMC Structure Biology, 2008
Query VR-RNNMGMPLIESSSYHDALFTLGYAGDRISQMLGMRYANNLHDLFLAEGYYEASQRKR Temp1 IAHIYANNLHDLFLAEGYYEASQRLFEIEL------FGLMGN------LSSWVGA----- ( ) Temp2 LLAQ-GRLSEMAGADALDVNIYIDSNG ( ) Temp ARDRAWQLEVERHRAQGTSASFL ( ) Temp GAKAVDLK (10 -5 ) Multi-Template Combination in Template and Alignment Space
Cheng, BMC Structure Biology, 2008 Advantage: reduce variance of modeling
Multi-Template VS Single-Top-Template Improve 38 / 45 targets Improvement by 6.8% P-value < Cheng, BMC Structure Biology, 2008
Combination of Template-Free and Template-Based Sampling 100% TBM100% FM50% TBM+50%FM Protein Modeling Spectrum
Recursive Protein Modeling – Integrate TBM and FM Model aligned / certain regions by TBM Model unaligned / uncertain regions by FM Keep certain regions / core fixed Compose TBM, FM components into larger certain components Repeat Satisfactory?No Yes Initial Region Decomposition Divide & Conquer Conditional Sampling Increase fitness & reduce bias Cheng et al., 2011
Recursive Modeling Mimics Protein Folding Cascade ks.uiuc.edu
Core-Constrained Tail Refinement
Template-Based + Template-Free & Recursive Modeling (CASP9) Cheng et al., 2011
Insights – A Bayesian Approach Incorporate prior information: template- based region Conditional sampling: use certain regions to constrain uncertain regions Reduce uncertainty gradually Iteratively optimize the conformation
Model Selection Single model approach Ensemble approach Wang et al., Proteins, 2009 Cheng et al., Proteins, 2009 Wang et al., Bioinformatics, 2011
Model Quality Evaluation Select top 5 ranked models as references......
Model Quality Assessment Compare each model with reference models Average global quality Re-rank models (+10%) Top Five AVERAGEAVERAGE Cheng et al., Proteins, 2009 Wang and Cheng, 2011
Iterative Ranking Randomly selecting five reference models seems to work Wang and Cheng, 2011
Model Refinement by Model Combination Model ranking Select top 5 models as seed models Structure comparison Identify similar models or fragments
Model Combination and Averaging Average Advantage: reduce variance of modeling – maximize likelihood
CASP9 Top 20 Servers
CASP9 Top 20 Servers on AB Initio Targets
Some High-Quality CASP Predictions T0390 GDT=0.90 T0426 GDT=0.97 T0432 GDT=0.92 T0458 GDT=0.97 Orange: structure; Green: model 50 of 120 CASP8 targets are in high-accuracy, RMSD < 2 Å Wang et al, 2010
Modeling Gene Regulation Process by Mining RNA-Seq Data Tens of thousands of genes Expression of gene is regulated Genes tend to function in groups Regulators and targets Hasty et al., 2001
Gene Regulatory Network Modeling (RNA-Seq, Microarray) Zhu et al., in preparation
RNA-Seq Data Processing Steps Isolate RNA Prepare a RNA library RNA sequencing by NGS Reads mapping Quantification and analysis Pepke et al., 2009
RNA-Seq Data Mapping Un-mapped reads Ambiguous reads Biological variance versus technology variance Tool: TopHat, Bowie Hass & Zody, 2010
Construct Gene Expression Profiles Count the number of reads mapped to each gene Normalize counts into quantitative values by length of genes and total number of reads Tools: Cufflink, HTseq, MULTICOM RPKM - reads per kb per million reads
Mapping Results of Mouse Transcriptome Perturbed by Drug-like Compounds Read file Num of reads mapped to unique site Num of reads mapped to multiple sites Num of unmapped reads Num of filtered-out reads Total number of reads Num of genes having reads Min, Ave, Max of gene expression value (RPKM) 110,270,3471,877,3683,828,69123,59416,000,00016, , 16, 5, ,004,0671,730,5663,239,98825,37916,000,00016, , 14, 4, ,493,5731,830,8893,656,90718,63116,000,00016, , 15, 4, ,631,2001,758,8133,583,89126,09616,000,00016, , 14, 3, ,801,1961,670,9153,501,80426,08516,000,00016, , 14, 4, ,855,8711,626,2553,492,68125,19316,000,00016, , 14, 2, ,909,2191,675,7973,389,33525,64916,000,00016, , 14, 3, ,431,9851,572,6273,968,97626,41216,000,00016, , 14, 3,052 Li et al., 2011
Identify Differentially Expressed Genes T-test (BioConductor) Poisson distribution (edgeR) Negative binomial distribution (DEGseq)
Differential Expression Analysis Li et al., 2011
Scatter Plot of Expression Values Li et al., 2011
Costa, 2010
Expression Profiles of Genes in Multiple Conditions Con 1Con 2Con 3Con 4Con 5Con 6Con 7Con 8 Gene Gene 2 Gene 3 Gene 4 ….
Gene Regulatory Network A cluster of genes having similar expression profiles Several regulators whose expression can explain the expression of the cluster of genes Segal et al., Nature Genetics, 2003
Expectation Maximization Approach Generate initial clusters using K-means Recursively select TFs to construct decision tree to maximize likelihood Reassign gene to a tree that maximize its likelihood Likelihood increased? No Yes
Regression Tree Construction Pick a TF Divide conditions into two subsets based on TF states Calculate mean and std Calculate likelihood of each expression value Select TF maximizing likelihood Repeat Zhu et al., in preparation
Gene Re-Clustering Find a path from root to leaf according to gene’s condition Calculate the likelihood of its expression value Assign a gene into a tree maximizing its likelihood
Construction of Gene Regulatory Networks Li et al., 2011 Genes Conditions Transcription factors Function Annotations
Validation of Gene Regulatory Networks Function validation (e.g. oxidation reduction) DNA binding site validation (e.g. TF binds genes) Literature validation (e.g. TF regulates condition) Experimental validation (e.g. to do) BetaBetaAlpha – Zinc Finger Zhu et al., in preparation
Modeling, Information Condensation and Knowledge Discovery iIlumina
Acknowledgements Group Members Badri Adhikari Debswapna Bhattacharya Renzhi Cao Xin Deng Jesse Eickholt Jilong Li Zheng Wang Mingzhu Zhu Collaborators MU Botanical Center, Soybean Research Groups