Modeling Protein Structures and Gene Regulatory Networks by Mining Protein and RNA-Seq Data Jianlin Jack Cheng, PhD Computer Science Department University.

Slides:



Advertisements
Similar presentations
RNA-Seq based discovery and reconstruction of unannotated transcripts
Advertisements

RNAseq.
12/04/2017 RNA seq (I) Edouard Severing.
Protein Structure Prediction using ROSETTA
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Simon v2.3 RNA-Seq Analysis Simon v2.3.
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Bioinformatics Needs for the post-genomic era Dr. Erik Bongcam-Rudloff The Linnaeus Centre for Bioinformatics.
Jianlin Cheng, PhD Informatics Institute, Computer Science Department University of Missouri, Columbia Fall, 2011.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Gene expression analysis summary Where are we now?
Functional genomics and inferring regulatory pathways with gene expression data.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Introduction to BioInformatics GCB/CIS535
Jianlin Cheng Computer Science Department & Informatics Institute
MULTICOM – A Combination Pipeline for Protein Structure Prediction
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Hybrid Protein Model Quality Assessment Jianlin Cheng Computer Science Department & Informatics Institute University of Missouri, Columbia, MO, USA.
Reconstruction of Gene Regulatory Networks from RNA-Seq Data Jianlin Jack Cheng Computer Science Department University of Missouri, Columbia ACM-BCB, 2014.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009.
Protein Tertiary Structure Prediction
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Radiogenomics in glioblastoma multiforme
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
RNAseq analyses -- methods
Lecture 11. Microarray and RNA-seq II
RNA-Seq Analysis Simon V4.1.
The iPlant Collaborative
Analysis of the yeast transcriptional regulatory network.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Jianlin Jack Cheng Computer Science Department University of Missouri, Columbia, USA Mexico, 2014.
Today Ensemble Methods. Recap of the course. Classifier Fusion
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
EB3233 Bioinformatics Introduction to Bioinformatics.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Introduction to RNAseq
Cluster validation Integration ICES Bioinformatics.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
No reference available
Motif Search and RNA Structure Prediction Lesson 9.
Other uses of DNA microarrays
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Bioinformatics for biologists (2) Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
RNA Quantitation from RNAseq Data
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Modelling the rice proteome
Network Inference Chris Holmes Oxford Centre for Gene Function, &,
Principle of Epistasis Analysis
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
Presentation transcript:

Modeling Protein Structures and Gene Regulatory Networks by Mining Protein and RNA-Seq Data Jianlin Jack Cheng, PhD Computer Science Department University of Missouri, Columbia March 7, 2012

Two Challenges  Protein Structure Modeling  Gene Regulatory Network Modeling

The Genomic Era Collins, Venter, Human Genome, 2000

Sequencing Revolution $1000 Personal Genome in 2010s Transcriptome Proteome

Genome Implications to Information Sciences and Life Sciences Elements and Systems

Growth of Protein Sequences AGCWY…

Growth of Protein Structures in PDB

Computational Protein Structure Folding / Prediction Structure = f ( sequence) ?  E = MC 2

Template-Based Approach MWLKKFGINKH… Protein Data Bank Fold Recognition Alignment Template Target protein Protein structure space is limited! Chothia, Nature,1992 Protein sequence space is astronomical!

Fisher, 2005 Modeller

Template-Free Protein Structure Prediction

Template-Free Approach Sampling: MCMC and Simulated Annealing MWLKKFGINLLIGQSV… …… Select structure with minimum free energy Simulation

Major Challenges in Protein Structure Prediction Select best templates? Generate best alignments? Generate best models? Select best models? Pick a needle in a stack of hay!

Major Challenges in Protein Structure Prediction Select best templates? Generate best alignments? Generate best models? Select best models? Wang, Eickholt, Cheng, Bioinformatics, 2010

A Conformation Ensemble Approach P(conformation) P(-energy) Conformation Distribution Maximum Likelihood & Maximum a Posterior Brooks et al., 2001

New Views on Protein Modeling Protein structure modeling problem is simply a grand computational and statistical sampling problem.  Random sampling (template-free)  Targeted sampling (template-based)

MARTCRKE… Input Query 1. Template Ranking Alignments MARTCRKEGAP-WY… Y-RMH-RDGM-MWT… MAR-TCRK-EGAPWY… TAKMTHK-DEGFGYW… Query-Template 1 Query-Template Combination MAR-TCRK-EGAP-WY… Y-R-MH-R-DGM-MWT… TAKMTHK-DEGFG-YW… Model Generation 2. Multiple-Template Combination 4. Evaluation & Refinement Output A Unified Protein Structure Prediction Pipeline Wang et al., Bioinformatics, 2010

Sampling in Alignment and Fold Space PSI-BLAST (sequence – profile) SAM (sequence – HMM) HMMer (sequence – HMM) Compass (profile – profile) HHSearch (HMM - HMM) PRC (HMM-HMM) FOLDpro (machine learning) MSACompro (profile-profile) Cheng, Baldi, Bioinformatics, 2006 Deng, Cheng, BMC Bioinformatics, 2011

Multi-Template Combination in Template and Alignment Space Query VR-RNNMGMPLIESSSYHDALFTLGYAGDRISQMLGMRYANNLHDLFLAEGYYEASQRKR Temp1 IAHIYANNLHDLFLAEGYYEASQRLFEIEL FGLMGN LSSWVGA ( ) Temp2 LLAQ-GRLSEMAGADALDVNIYIDSNG ( ) Temp3 QGTARDRAWQLEVERHRAQGTSASFL ( ) Temp4 AANQLDAMRALGYAQERYFEMDLMRRAPAGELSELFGAKAVDLK (10 -5 ) Cheng, BMC Structure Biology, 2008

Query VR-RNNMGMPLIESSSYHDALFTLGYAGDRISQMLGMRYANNLHDLFLAEGYYEASQRKR Temp1 IAHIYANNLHDLFLAEGYYEASQRLFEIEL------FGLMGN------LSSWVGA----- ( ) Temp2 LLAQ-GRLSEMAGADALDVNIYIDSNG ( ) Temp ARDRAWQLEVERHRAQGTSASFL ( ) Temp GAKAVDLK (10 -5 ) Multi-Template Combination in Template and Alignment Space

Cheng, BMC Structure Biology, 2008 Advantage: reduce variance of modeling

Multi-Template VS Single-Top-Template Improve 38 / 45 targets Improvement by 6.8% P-value < Cheng, BMC Structure Biology, 2008

Combination of Template-Free and Template-Based Sampling 100% TBM100% FM50% TBM+50%FM Protein Modeling Spectrum

Recursive Protein Modeling – Integrate TBM and FM Model aligned / certain regions by TBM Model unaligned / uncertain regions by FM Keep certain regions / core fixed Compose TBM, FM components into larger certain components Repeat Satisfactory?No Yes Initial Region Decomposition Divide & Conquer Conditional Sampling Increase fitness & reduce bias Cheng et al., 2011

Recursive Modeling Mimics Protein Folding Cascade ks.uiuc.edu

Core-Constrained Tail Refinement

Template-Based + Template-Free & Recursive Modeling (CASP9) Cheng et al., 2011

Insights – A Bayesian Approach Incorporate prior information: template- based region Conditional sampling: use certain regions to constrain uncertain regions Reduce uncertainty gradually Iteratively optimize the conformation

Model Selection Single model approach Ensemble approach Wang et al., Proteins, 2009 Cheng et al., Proteins, 2009 Wang et al., Bioinformatics, 2011

Model Quality Evaluation Select top 5 ranked models as references......

Model Quality Assessment Compare each model with reference models Average global quality Re-rank models (+10%) Top Five AVERAGEAVERAGE Cheng et al., Proteins, 2009 Wang and Cheng, 2011

Iterative Ranking Randomly selecting five reference models seems to work Wang and Cheng, 2011

Model Refinement by Model Combination Model ranking Select top 5 models as seed models Structure comparison Identify similar models or fragments

Model Combination and Averaging Average Advantage: reduce variance of modeling – maximize likelihood

CASP9 Top 20 Servers

CASP9 Top 20 Servers on AB Initio Targets

Some High-Quality CASP Predictions T0390 GDT=0.90 T0426 GDT=0.97 T0432 GDT=0.92 T0458 GDT=0.97 Orange: structure; Green: model 50 of 120 CASP8 targets are in high-accuracy, RMSD < 2 Å Wang et al, 2010

Modeling Gene Regulation Process by Mining RNA-Seq Data Tens of thousands of genes Expression of gene is regulated Genes tend to function in groups Regulators and targets Hasty et al., 2001

Gene Regulatory Network Modeling (RNA-Seq, Microarray) Zhu et al., in preparation

RNA-Seq Data Processing Steps Isolate RNA Prepare a RNA library RNA sequencing by NGS Reads mapping Quantification and analysis Pepke et al., 2009

RNA-Seq Data Mapping Un-mapped reads Ambiguous reads Biological variance versus technology variance Tool: TopHat, Bowie Hass & Zody, 2010

Construct Gene Expression Profiles Count the number of reads mapped to each gene Normalize counts into quantitative values by length of genes and total number of reads Tools: Cufflink, HTseq, MULTICOM RPKM - reads per kb per million reads

Mapping Results of Mouse Transcriptome Perturbed by Drug-like Compounds Read file Num of reads mapped to unique site Num of reads mapped to multiple sites Num of unmapped reads Num of filtered-out reads Total number of reads Num of genes having reads Min, Ave, Max of gene expression value (RPKM) 110,270,3471,877,3683,828,69123,59416,000,00016, , 16, 5, ,004,0671,730,5663,239,98825,37916,000,00016, , 14, 4, ,493,5731,830,8893,656,90718,63116,000,00016, , 15, 4, ,631,2001,758,8133,583,89126,09616,000,00016, , 14, 3, ,801,1961,670,9153,501,80426,08516,000,00016, , 14, 4, ,855,8711,626,2553,492,68125,19316,000,00016, , 14, 2, ,909,2191,675,7973,389,33525,64916,000,00016, , 14, 3, ,431,9851,572,6273,968,97626,41216,000,00016, , 14, 3,052 Li et al., 2011

Identify Differentially Expressed Genes T-test (BioConductor) Poisson distribution (edgeR) Negative binomial distribution (DEGseq)

Differential Expression Analysis Li et al., 2011

Scatter Plot of Expression Values Li et al., 2011

Costa, 2010

Expression Profiles of Genes in Multiple Conditions Con 1Con 2Con 3Con 4Con 5Con 6Con 7Con 8 Gene Gene 2 Gene 3 Gene 4 ….

Gene Regulatory Network A cluster of genes having similar expression profiles Several regulators whose expression can explain the expression of the cluster of genes Segal et al., Nature Genetics, 2003

Expectation Maximization Approach Generate initial clusters using K-means Recursively select TFs to construct decision tree to maximize likelihood Reassign gene to a tree that maximize its likelihood Likelihood increased? No Yes

Regression Tree Construction Pick a TF Divide conditions into two subsets based on TF states Calculate mean and std Calculate likelihood of each expression value Select TF maximizing likelihood Repeat Zhu et al., in preparation

Gene Re-Clustering Find a path from root to leaf according to gene’s condition Calculate the likelihood of its expression value Assign a gene into a tree maximizing its likelihood

Construction of Gene Regulatory Networks Li et al., 2011 Genes Conditions Transcription factors Function Annotations

Validation of Gene Regulatory Networks Function validation (e.g. oxidation reduction) DNA binding site validation (e.g. TF binds genes) Literature validation (e.g. TF regulates condition) Experimental validation (e.g. to do) BetaBetaAlpha – Zinc Finger Zhu et al., in preparation

Modeling, Information Condensation and Knowledge Discovery iIlumina

Acknowledgements Group Members Badri Adhikari Debswapna Bhattacharya Renzhi Cao Xin Deng Jesse Eickholt Jilong Li Zheng Wang Mingzhu Zhu Collaborators MU Botanical Center, Soybean Research Groups