Download presentation
Presentation is loading. Please wait.
Published byZoe Barber Modified over 9 years ago
1
Modeling Protein Structures and Gene Regulatory Networks by Mining Protein and RNA-Seq Data Jianlin Jack Cheng, PhD Computer Science Department University of Missouri, Columbia March 7, 2012
2
Two Challenges Protein Structure Modeling Gene Regulatory Network Modeling
3
The Genomic Era Collins, Venter, Human Genome, 2000
4
Sequencing Revolution $1000 Personal Genome in 2010s Transcriptome Proteome
5
Genome Implications to Information Sciences and Life Sciences Elements and Systems
6
Growth of Protein Sequences AGCWY…
7
Growth of Protein Structures in PDB
8
Computational Protein Structure Folding / Prediction Structure = f ( sequence) ? E = MC 2
9
Template-Based Approach MWLKKFGINKH… Protein Data Bank Fold Recognition Alignment Template Target protein Protein structure space is limited! Chothia, Nature,1992 Protein sequence space is astronomical!
10
Fisher, 2005 Modeller
11
Template-Free Protein Structure Prediction http://pubs.acs.org/subscribe/archive/mdd/v03/i09/html/willis.html
12
Template-Free Approach Sampling: MCMC and Simulated Annealing MWLKKFGINLLIGQSV… …… Select structure with minimum free energy Simulation
13
Major Challenges in Protein Structure Prediction Select best templates? Generate best alignments? Generate best models? Select best models? Pick a needle in a stack of hay!
14
Major Challenges in Protein Structure Prediction Select best templates? Generate best alignments? Generate best models? Select best models? Wang, Eickholt, Cheng, Bioinformatics, 2010
15
A Conformation Ensemble Approach P(conformation) P(-energy) Conformation Distribution Maximum Likelihood & Maximum a Posterior Brooks et al., 2001
16
New Views on Protein Modeling Protein structure modeling problem is simply a grand computational and statistical sampling problem. Random sampling (template-free) Targeted sampling (template-based)
17
...... MARTCRKE… Input Query 1. Template Ranking Alignments MARTCRKEGAP-WY… Y-RMH-RDGM-MWT… MAR-TCRK-EGAPWY… TAKMTHK-DEGFGYW… Query-Template 1 Query-Template 2...... Combination MAR-TCRK-EGAP-WY… Y-R-MH-R-DGM-MWT… TAKMTHK-DEGFG-YW…...... 3. Model Generation 2. Multiple-Template Combination 4. Evaluation & Refinement Output A Unified Protein Structure Prediction Pipeline Wang et al., Bioinformatics, 2010
18
Sampling in Alignment and Fold Space PSI-BLAST (sequence – profile) SAM (sequence – HMM) HMMer (sequence – HMM) Compass (profile – profile) HHSearch (HMM - HMM) PRC (HMM-HMM) FOLDpro (machine learning) MSACompro (profile-profile) Cheng, Baldi, Bioinformatics, 2006 Deng, Cheng, BMC Bioinformatics, 2011
19
Multi-Template Combination in Template and Alignment Space Query VR-RNNMGMPLIESSSYHDALFTLGYAGDRISQMLGMRYANNLHDLFLAEGYYEASQRKR Temp1 IAHIYANNLHDLFLAEGYYEASQRLFEIEL FGLMGN LSSWVGA (10 -80 ) Temp2 LLAQ-GRLSEMAGADALDVNIYIDSNG (10 -70 ) Temp3 QGTARDRAWQLEVERHRAQGTSASFL (10 -10 ) Temp4 AANQLDAMRALGYAQERYFEMDLMRRAPAGELSELFGAKAVDLK (10 -5 ) Cheng, BMC Structure Biology, 2008
20
Query VR-RNNMGMPLIESSSYHDALFTLGYAGDRISQMLGMRYANNLHDLFLAEGYYEASQRKR Temp1 IAHIYANNLHDLFLAEGYYEASQRLFEIEL------FGLMGN------LSSWVGA----- (10 -80 ) Temp2 LLAQ-GRLSEMAGADALDVNIYIDSNG--------------------------------- (10 -70 ) Temp3 ---------------------------ARDRAWQLEVERHRAQGTSASFL---------- (10 -10 ) Temp4 ----------------------------------------------------GAKAVDLK (10 -5 ) Multi-Template Combination in Template and Alignment Space
21
Cheng, BMC Structure Biology, 2008 Advantage: reduce variance of modeling
22
Multi-Template VS Single-Top-Template Improve 38 / 45 targets Improvement by 6.8% P-value < 10 -4 Cheng, BMC Structure Biology, 2008
23
Combination of Template-Free and Template-Based Sampling 100% TBM100% FM50% TBM+50%FM Protein Modeling Spectrum
24
Recursive Protein Modeling – Integrate TBM and FM Model aligned / certain regions by TBM Model unaligned / uncertain regions by FM Keep certain regions / core fixed Compose TBM, FM components into larger certain components Repeat Satisfactory?No Yes Initial Region Decomposition Divide & Conquer Conditional Sampling Increase fitness & reduce bias Cheng et al., 2011
25
Recursive Modeling Mimics Protein Folding Cascade ks.uiuc.edu
26
Core-Constrained Tail Refinement
27
Template-Based + Template-Free & Recursive Modeling (CASP9) Cheng et al., 2011
28
Insights – A Bayesian Approach Incorporate prior information: template- based region Conditional sampling: use certain regions to constrain uncertain regions Reduce uncertainty gradually Iteratively optimize the conformation
29
Model Selection Single model approach Ensemble approach Wang et al., Proteins, 2009 Cheng et al., Proteins, 2009 Wang et al., Bioinformatics, 2011
30
Model Quality Evaluation Select top 5 ranked models as references......
31
Model Quality Assessment Compare each model with reference models Average global quality Re-rank models (+10%)...... Top Five AVERAGEAVERAGE Cheng et al., Proteins, 2009 Wang and Cheng, 2011
32
Iterative Ranking Randomly selecting five reference models seems to work Wang and Cheng, 2011
33
Model Refinement by Model Combination...... Model ranking Select top 5 models as seed models...... Structure comparison Identify similar models or fragments
34
Model Combination and Averaging Average Advantage: reduce variance of modeling – maximize likelihood
35
CASP9 Top 20 Servers http://predictioncenter.org/casp9/
36
CASP9 Top 20 Servers on AB Initio Targets http://predictioncenter.org/casp9/
37
Some High-Quality CASP Predictions T0390 GDT=0.90 T0426 GDT=0.97 T0432 GDT=0.92 T0458 GDT=0.97 Orange: structure; Green: model 50 of 120 CASP8 targets are in high-accuracy, RMSD < 2 Å Wang et al, 2010
38
Modeling Gene Regulation Process by Mining RNA-Seq Data Tens of thousands of genes Expression of gene is regulated Genes tend to function in groups Regulators and targets Hasty et al., 2001
39
Gene Regulatory Network Modeling (RNA-Seq, Microarray) Zhu et al., in preparation
40
RNA-Seq Data Processing Steps Isolate RNA Prepare a RNA library RNA sequencing by NGS Reads mapping Quantification and analysis Pepke et al., 2009
41
RNA-Seq Data Mapping Un-mapped reads Ambiguous reads Biological variance versus technology variance Tool: TopHat, Bowie Hass & Zody, 2010
42
Construct Gene Expression Profiles Count the number of reads mapped to each gene Normalize counts into quantitative values by length of genes and total number of reads Tools: Cufflink, HTseq, MULTICOM RPKM - reads per kb per million reads
43
Mapping Results of Mouse Transcriptome Perturbed by Drug-like Compounds Read file Num of reads mapped to unique site Num of reads mapped to multiple sites Num of unmapped reads Num of filtered-out reads Total number of reads Num of genes having reads Min, Ave, Max of gene expression value (RPKM) 110,270,3471,877,3683,828,69123,59416,000,00016,2710.00019, 16, 5,188 211,004,0671,730,5663,239,98825,37916,000,00016,5430.000082, 14, 4,973 310,493,5731,830,8893,656,90718,63116,000,00016,4390.00019, 15, 4,104 410,631,2001,758,8133,583,89126,09616,000,00016,5220.00015, 14, 3,197 510,801,1961,670,9153,501,80426,08516,000,00016,8750.00045, 14, 4,086 610,855,8711,626,2553,492,68125,19316,000,00016,8620.00034, 14, 2,040 710,909,2191,675,7973,389,33525,64916,000,00016,9470.00054, 14, 3,474 810,431,9851,572,6273,968,97626,41216,000,00016,9460.00046, 14, 3,052 Li et al., 2011
44
Identify Differentially Expressed Genes T-test (BioConductor) Poisson distribution (edgeR) Negative binomial distribution (DEGseq)
45
Differential Expression Analysis Li et al., 2011
46
Scatter Plot of Expression Values Li et al., 2011
47
Costa, 2010
48
Expression Profiles of Genes in Multiple Conditions Con 1Con 2Con 3Con 4Con 5Con 6Con 7Con 8 Gene 11030403520100560 Gene 2 Gene 3 Gene 4 ….
49
Gene Regulatory Network A cluster of genes having similar expression profiles Several regulators whose expression can explain the expression of the cluster of genes Segal et al., Nature Genetics, 2003
50
Expectation Maximization Approach Generate initial clusters using K-means Recursively select TFs to construct decision tree to maximize likelihood Reassign gene to a tree that maximize its likelihood Likelihood increased? No Yes
51
Regression Tree Construction Pick a TF Divide conditions into two subsets based on TF states Calculate mean and std Calculate likelihood of each expression value Select TF maximizing likelihood Repeat Zhu et al., in preparation
52
Gene Re-Clustering Find a path from root to leaf according to gene’s condition Calculate the likelihood of its expression value Assign a gene into a tree maximizing its likelihood
53
Construction of Gene Regulatory Networks Li et al., 2011 Genes Conditions Transcription factors Function Annotations
54
Validation of Gene Regulatory Networks Function validation (e.g. oxidation reduction) DNA binding site validation (e.g. TF binds genes) Literature validation (e.g. TF regulates condition) Experimental validation (e.g. to do) BetaBetaAlpha – Zinc Finger Zhu et al., in preparation
55
Modeling, Information Condensation and Knowledge Discovery iIlumina
56
Acknowledgements Group Members Badri Adhikari Debswapna Bhattacharya Renzhi Cao Xin Deng Jesse Eickholt Jilong Li Zheng Wang Mingzhu Zhu Collaborators MU Botanical Center, Soybean Research Groups
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.