Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modeling Protein Structures and Gene Regulatory Networks by Mining Protein and RNA-Seq Data Jianlin Jack Cheng, PhD Computer Science Department University.

Similar presentations


Presentation on theme: "Modeling Protein Structures and Gene Regulatory Networks by Mining Protein and RNA-Seq Data Jianlin Jack Cheng, PhD Computer Science Department University."— Presentation transcript:

1 Modeling Protein Structures and Gene Regulatory Networks by Mining Protein and RNA-Seq Data Jianlin Jack Cheng, PhD Computer Science Department University of Missouri, Columbia March 7, 2012

2 Two Challenges  Protein Structure Modeling  Gene Regulatory Network Modeling

3 The Genomic Era Collins, Venter, Human Genome, 2000

4 Sequencing Revolution $1000 Personal Genome in 2010s Transcriptome Proteome

5 Genome Implications to Information Sciences and Life Sciences Elements and Systems

6 Growth of Protein Sequences AGCWY…

7 Growth of Protein Structures in PDB

8 Computational Protein Structure Folding / Prediction Structure = f ( sequence) ?  E = MC 2

9 Template-Based Approach MWLKKFGINKH… Protein Data Bank Fold Recognition Alignment Template Target protein Protein structure space is limited! Chothia, Nature,1992 Protein sequence space is astronomical!

10 Fisher, 2005 Modeller

11 Template-Free Protein Structure Prediction http://pubs.acs.org/subscribe/archive/mdd/v03/i09/html/willis.html

12 Template-Free Approach Sampling: MCMC and Simulated Annealing MWLKKFGINLLIGQSV… …… Select structure with minimum free energy Simulation

13 Major Challenges in Protein Structure Prediction Select best templates? Generate best alignments? Generate best models? Select best models? Pick a needle in a stack of hay!

14 Major Challenges in Protein Structure Prediction Select best templates? Generate best alignments? Generate best models? Select best models? Wang, Eickholt, Cheng, Bioinformatics, 2010

15 A Conformation Ensemble Approach P(conformation) P(-energy) Conformation Distribution Maximum Likelihood & Maximum a Posterior Brooks et al., 2001

16 New Views on Protein Modeling Protein structure modeling problem is simply a grand computational and statistical sampling problem.  Random sampling (template-free)  Targeted sampling (template-based)

17 ...... MARTCRKE… Input Query 1. Template Ranking Alignments MARTCRKEGAP-WY… Y-RMH-RDGM-MWT… MAR-TCRK-EGAPWY… TAKMTHK-DEGFGYW… Query-Template 1 Query-Template 2...... Combination MAR-TCRK-EGAP-WY… Y-R-MH-R-DGM-MWT… TAKMTHK-DEGFG-YW…...... 3. Model Generation 2. Multiple-Template Combination 4. Evaluation & Refinement Output A Unified Protein Structure Prediction Pipeline Wang et al., Bioinformatics, 2010

18 Sampling in Alignment and Fold Space PSI-BLAST (sequence – profile) SAM (sequence – HMM) HMMer (sequence – HMM) Compass (profile – profile) HHSearch (HMM - HMM) PRC (HMM-HMM) FOLDpro (machine learning) MSACompro (profile-profile) Cheng, Baldi, Bioinformatics, 2006 Deng, Cheng, BMC Bioinformatics, 2011

19 Multi-Template Combination in Template and Alignment Space Query VR-RNNMGMPLIESSSYHDALFTLGYAGDRISQMLGMRYANNLHDLFLAEGYYEASQRKR Temp1 IAHIYANNLHDLFLAEGYYEASQRLFEIEL FGLMGN LSSWVGA (10 -80 ) Temp2 LLAQ-GRLSEMAGADALDVNIYIDSNG (10 -70 ) Temp3 QGTARDRAWQLEVERHRAQGTSASFL (10 -10 ) Temp4 AANQLDAMRALGYAQERYFEMDLMRRAPAGELSELFGAKAVDLK (10 -5 ) Cheng, BMC Structure Biology, 2008

20 Query VR-RNNMGMPLIESSSYHDALFTLGYAGDRISQMLGMRYANNLHDLFLAEGYYEASQRKR Temp1 IAHIYANNLHDLFLAEGYYEASQRLFEIEL------FGLMGN------LSSWVGA----- (10 -80 ) Temp2 LLAQ-GRLSEMAGADALDVNIYIDSNG--------------------------------- (10 -70 ) Temp3 ---------------------------ARDRAWQLEVERHRAQGTSASFL---------- (10 -10 ) Temp4 ----------------------------------------------------GAKAVDLK (10 -5 ) Multi-Template Combination in Template and Alignment Space

21 Cheng, BMC Structure Biology, 2008 Advantage: reduce variance of modeling

22 Multi-Template VS Single-Top-Template Improve 38 / 45 targets Improvement by 6.8% P-value < 10 -4 Cheng, BMC Structure Biology, 2008

23 Combination of Template-Free and Template-Based Sampling 100% TBM100% FM50% TBM+50%FM Protein Modeling Spectrum

24 Recursive Protein Modeling – Integrate TBM and FM Model aligned / certain regions by TBM Model unaligned / uncertain regions by FM Keep certain regions / core fixed Compose TBM, FM components into larger certain components Repeat Satisfactory?No Yes Initial Region Decomposition Divide & Conquer Conditional Sampling Increase fitness & reduce bias Cheng et al., 2011

25 Recursive Modeling Mimics Protein Folding Cascade ks.uiuc.edu

26 Core-Constrained Tail Refinement

27 Template-Based + Template-Free & Recursive Modeling (CASP9) Cheng et al., 2011

28 Insights – A Bayesian Approach Incorporate prior information: template- based region Conditional sampling: use certain regions to constrain uncertain regions Reduce uncertainty gradually Iteratively optimize the conformation

29 Model Selection Single model approach Ensemble approach Wang et al., Proteins, 2009 Cheng et al., Proteins, 2009 Wang et al., Bioinformatics, 2011

30 Model Quality Evaluation Select top 5 ranked models as references......

31 Model Quality Assessment Compare each model with reference models Average global quality Re-rank models (+10%)...... Top Five AVERAGEAVERAGE Cheng et al., Proteins, 2009 Wang and Cheng, 2011

32 Iterative Ranking Randomly selecting five reference models seems to work Wang and Cheng, 2011

33 Model Refinement by Model Combination...... Model ranking Select top 5 models as seed models...... Structure comparison Identify similar models or fragments

34 Model Combination and Averaging Average Advantage: reduce variance of modeling – maximize likelihood

35 CASP9 Top 20 Servers http://predictioncenter.org/casp9/

36 CASP9 Top 20 Servers on AB Initio Targets http://predictioncenter.org/casp9/

37 Some High-Quality CASP Predictions T0390 GDT=0.90 T0426 GDT=0.97 T0432 GDT=0.92 T0458 GDT=0.97 Orange: structure; Green: model 50 of 120 CASP8 targets are in high-accuracy, RMSD < 2 Å Wang et al, 2010

38 Modeling Gene Regulation Process by Mining RNA-Seq Data Tens of thousands of genes Expression of gene is regulated Genes tend to function in groups Regulators and targets Hasty et al., 2001

39 Gene Regulatory Network Modeling (RNA-Seq, Microarray) Zhu et al., in preparation

40 RNA-Seq Data Processing Steps Isolate RNA Prepare a RNA library RNA sequencing by NGS Reads mapping Quantification and analysis Pepke et al., 2009

41 RNA-Seq Data Mapping Un-mapped reads Ambiguous reads Biological variance versus technology variance Tool: TopHat, Bowie Hass & Zody, 2010

42 Construct Gene Expression Profiles Count the number of reads mapped to each gene Normalize counts into quantitative values by length of genes and total number of reads Tools: Cufflink, HTseq, MULTICOM RPKM - reads per kb per million reads

43 Mapping Results of Mouse Transcriptome Perturbed by Drug-like Compounds Read file Num of reads mapped to unique site Num of reads mapped to multiple sites Num of unmapped reads Num of filtered-out reads Total number of reads Num of genes having reads Min, Ave, Max of gene expression value (RPKM) 110,270,3471,877,3683,828,69123,59416,000,00016,2710.00019, 16, 5,188 211,004,0671,730,5663,239,98825,37916,000,00016,5430.000082, 14, 4,973 310,493,5731,830,8893,656,90718,63116,000,00016,4390.00019, 15, 4,104 410,631,2001,758,8133,583,89126,09616,000,00016,5220.00015, 14, 3,197 510,801,1961,670,9153,501,80426,08516,000,00016,8750.00045, 14, 4,086 610,855,8711,626,2553,492,68125,19316,000,00016,8620.00034, 14, 2,040 710,909,2191,675,7973,389,33525,64916,000,00016,9470.00054, 14, 3,474 810,431,9851,572,6273,968,97626,41216,000,00016,9460.00046, 14, 3,052 Li et al., 2011

44 Identify Differentially Expressed Genes T-test (BioConductor) Poisson distribution (edgeR) Negative binomial distribution (DEGseq)

45 Differential Expression Analysis Li et al., 2011

46 Scatter Plot of Expression Values Li et al., 2011

47 Costa, 2010

48 Expression Profiles of Genes in Multiple Conditions Con 1Con 2Con 3Con 4Con 5Con 6Con 7Con 8 Gene 11030403520100560 Gene 2 Gene 3 Gene 4 ….

49 Gene Regulatory Network A cluster of genes having similar expression profiles Several regulators whose expression can explain the expression of the cluster of genes Segal et al., Nature Genetics, 2003

50 Expectation Maximization Approach Generate initial clusters using K-means Recursively select TFs to construct decision tree to maximize likelihood Reassign gene to a tree that maximize its likelihood Likelihood increased? No Yes

51 Regression Tree Construction Pick a TF Divide conditions into two subsets based on TF states Calculate mean and std Calculate likelihood of each expression value Select TF maximizing likelihood Repeat Zhu et al., in preparation

52 Gene Re-Clustering Find a path from root to leaf according to gene’s condition Calculate the likelihood of its expression value Assign a gene into a tree maximizing its likelihood

53 Construction of Gene Regulatory Networks Li et al., 2011 Genes Conditions Transcription factors Function Annotations

54 Validation of Gene Regulatory Networks Function validation (e.g. oxidation reduction) DNA binding site validation (e.g. TF binds genes) Literature validation (e.g. TF regulates condition) Experimental validation (e.g. to do) BetaBetaAlpha – Zinc Finger Zhu et al., in preparation

55 Modeling, Information Condensation and Knowledge Discovery iIlumina

56 Acknowledgements Group Members Badri Adhikari Debswapna Bhattacharya Renzhi Cao Xin Deng Jesse Eickholt Jilong Li Zheng Wang Mingzhu Zhu Collaborators MU Botanical Center, Soybean Research Groups


Download ppt "Modeling Protein Structures and Gene Regulatory Networks by Mining Protein and RNA-Seq Data Jianlin Jack Cheng, PhD Computer Science Department University."

Similar presentations


Ads by Google