Model Selection in Machine Learning + Predicting Gene Expression from ChIP-Seq signals https://twitter.com/nature.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Methods to read out regulatory functions
Polynomial Curve Fitting BITS C464/BITS F464 Navneet Goyal Department of Computer Science, BITS-Pilani, Pilani Campus, India.
Regulomics II: Epigenetics and the histone code Jim Noonan GENE760.
Correlation and regression
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA
Analysis of ChIP-Seq Data
Classification and Prediction: Regression Analysis
[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.
Comparative Genomics II: Functional comparisons Caterino and Hayes, 2007.
ENCODE enhancers 12/13/2013 Yao Fu Gerstein lab. ‘Supervised’ enhancer prediction Yip et al., Genome Biology (2012) Get enhancer list away to genes DNase.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
Controls for TTS identification using PET A series of controls were implemented in order to evaluate the potential contamination by internal priming in.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Ct log DNA ( pmol) P1P2 Supplementary Fig. S1 Standard curve of the PCR amplification efficiency of transcripts.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
The Bias-Variance Trade-Off Oliver Schulte Machine Learning 726.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Differential Principal Component Analysis (dPCA) for ChIP-seq
4 male, 4 female LCLs HumanChimpanzeeRhesus Macaque Expression: RNAseq Active Gene Marks: Pol II (ChIPseq) H3K4me3 (ChIPseq) Repressed Region Mark: H3K27me3.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
DNA Microarray Data Analysis using Artificial Neural Network Models. by Venkatanand Venkatachalapathy (‘Venkat’) ECE/ CS/ ME 539 Course Project.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Machine Learning 5. Parametric Methods.
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
TSS Dato ciascun TSS, calcolare quante sequenze mappano tra e bp rispetto al TSS Contare quante sequenze mappano a -1000, -999,
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
Review of statistical modeling and probability theory Alan Moses ML4bio.
CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Genomics 2015/16 Silvia del Burgo. + Same genome for all cells that arise from single fertilized egg, Identity?  Epigenomic signatures + Epigenomics:
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
The Chromatin State The scientific quest to decipher the histone code Lior Zimmerman.
Volume 26, Issue 1, Pages (July 2013)
Machine Learning with Spark MLlib
Functional Elements in the Human Genome
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
Gene expression.
DNase‐HS sites are main independent determinants of DNA replication timing Simulations based on genome sequence features (GC content, CpG islands), or.
Sensitivity of RNA‐seq.
1. Interpreting rich epigenomic datasets
Correlated occupancies across FACT-bound regions.
High-Resolution Profiling of Histone Methylations in the Human Genome
Volume 44, Issue 3, Pages (November 2011)
A Phase Separation Model for Transcriptional Control
Widespread modulation of epigenetic landscape for differentiated organoids is driven by Hnf4g Widespread modulation of epigenetic landscape for differentiated.
High-Resolution Profiling of Histone Methylations in the Human Genome
Volume 17, Issue 6, Pages (November 2016)
Volume 67, Issue 6, Pages e6 (September 2017)
Volume 44, Issue 3, Pages (November 2011)
Proteins Kinases: Chromatin-Associated Enzymes?
Volume 26, Issue 1, Pages (July 2013)
Volume 132, Issue 2, Pages (January 2008)
Volume 10, Issue 10, Pages (October 2017)
Volume 66, Issue 4, Pages e4 (May 2017)
Interactions between Retroviruses and the Host Cell Genome
Shih-Yang Su Virginia Tech
Multiplex Enhancer Interference Reveals Collaborative Control of Gene Regulation by Estrogen Receptor α-Bound Enhancers  Julia B. Carleton, Kristofer.
Chromatin state mapping pinpoints PAX3–FOXO1 (P3F) in active enhancers
HOXA9 and STAT5 co-occupy similar genomic regions and increase JAK/STAT signaling. HOXA9 and STAT5 co-occupy similar genomic regions and increase JAK/STAT.
Presentation transcript:

Model Selection in Machine Learning + Predicting Gene Expression from ChIP-Seq signals https://twitter.com/nature

Review and warm-up questions What makes a good feature? Andrew Ng

Training vs. Cross-Validation Fit model to example data points Evaluate model on separate set of data points https://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/

Bias vs. Variance What’s the problem? Model too simple Model too complex What’s the problem? Model = multinomial regression Bias is defined by underfitting of data High bias  More training data won’t help. High variance is defined by overfitting of data High variance  More training data WILL help Adapted from Andrew Ng – https://www.youtube.com/watch?v=DYCv5e0Isow, http://see.stanford.edu/materials/aimlcs229/ML-advice.pdf

What’s the problem? – Bias Ask students for feedback. Are we overfitting or underfitting? Adapted from Andrew Ng – https://www.youtube.com/watch?v=DYCv5e0Isow, http://see.stanford.edu/materials/aimlcs229/ML-advice.pdf

Learning curve – Bias Bias Model too simple Adapted from Andrew Ng – https://www.youtube.com/watch?v=DYCv5e0Isow, http://see.stanford.edu/materials/aimlcs229/ML-advice.pdf

What’s the problem? – Variance Ask students for feedback Adapted from Andrew Ng – https://www.youtube.com/watch?v=DYCv5e0Isow, http://see.stanford.edu/materials/aimlcs229/ML-advice.pdf

Learning curve – Variance Model too complex Ask students: Will more data help? Adapted from Andrew Ng – https://www.youtube.com/watch?v=DYCv5e0Isow, http://see.stanford.edu/materials/aimlcs229/ML-advice.pdf

What is the next step? Bias Variance Adapted from Andrew Ng – https://www.youtube.com/watch?v=DYCv5e0Isow, http://see.stanford.edu/materials/aimlcs229/ML-advice.pdf

What is the next step? Bias Variance More training features Train more complicated model More training examples Try fewer features  Dimension Reduction Simplify model Adapted from Andrew Ng – https://www.youtube.com/watch?v=DYCv5e0Isow, http://see.stanford.edu/materials/aimlcs229/ML-advice.pdf

Practical Application: Predicting gene expression from ChIP-Seq signals

Where should we start? TSS (transcription start site) Bin 120-81 (TTS-4kb to TTS) TSS (transcription start site) TTS (transcription terminal site) Gene k Bin 121-160 (TTS to TTS+4kb) Bin 41-80 (TSS to TSS+4kb) Bin 40-1 (TSS-4kb to TSS) 40 1 4 … 41 ….44 80 120 81 121 160 Park et al Nature Reviews Genetics 2009, Rozowsky et al Nature Biotech 2009, http://images.nigms.nih.gov/imageRepository/2484/RNA_pol_II_medium.jpg, Cheng et al Genome Biology 2011

RNA is transcribed by RNA polymerase RNA pol II ChIP at Transcription Start Sites RNA polymerase II – Crystal structure Roger Kornberg Nobel Prize Rozowsky et al. Nature Biotech 2009 http://images.nigms.nih.gov/imageRepository/2484/RNA_pol_II_medium.jpg

Relating Genomic Inputs to Outputs Cell Type 1 Cell Type 2 Mark Gerstein/Mengting Gu

Initial model -4000 TSS +4000 data from K562 cell line, ENCODE consortium Sum pol II ChIP signal across 8000 bp centered around transcription start site. log2(RNA-Seq RPKM) = a + b*log2(RNA Pol II ChIP) Why log scale??

Can we do better than this? Pearson’s R = 0.39 Can we add more genes? Would that help? Can we use a more complex model? What about other ChIP data? log2 scale K562 Cell Line, ENCODE data

Can we do better than this? Pearson’s R = 0.39 HOW? Can we add more genes? Would that help? Can we use a more complex model? What about other ChIP data? log2 scale K562 Cell Line, ENCODE data

Learning curve: What’s our problem? Mean squared error of Linear Regression for pol II ChIP vs. RNA-Seq What does the a single point on this graph mean? This is BIAS, so we need more training features! How can we get more features? Can we add more genes? Would that help? Can we use a more complex model? What about other ChIP data?

Learning curve: What’s our problem? Mean squared error of Linear Regression for pol II ChIP vs. RNA-Seq This is BIAS More training features and/or More complicated model This is BIAS, so we need more training features! How can we get more features? Can we add more genes? Would that help? Can we use a more complex model? What about other ChIP data?

Polynomial regression: polII ChIP vs. RNA-Seq log2(RNA-Seq RPKM) = a + b*log2(RNA Pol II ChIP) + c*log2(RNA Pol II ChIP)^2 + d*log2(RNA Pol II ChIP)^3 + … +j*log2(RNA Pol II ChIP)^10 Mean squared error Proportion of training data (15,000 genes)

Polynomial regression: polII ChIP vs. RNA-Seq Still BIASED! More training features and/or More complicated model Mean squared error Proportion of training data (15,000 genes)

Y = aX1 + bX2 + c Mark Gerstein/Mengting Gu

Adding more signals Total signal from Promoter RNA-Seq Gene 1 Gene 2 … Gene N 10 12 624 592 125 224 900 29 340 99 22 94 29 3 100 135 135 272 H3K27me3 H3K9me3 H3K36me3 H3K4me1 H3K27Ac RNA polII H3K4me1 H3K4me3 Take log2 of each element

Multiple Linear Regression Mean squared error Proportion of training data (15,000 genes) Y = aX1 + bX2 + c + …

Multiple Linear Regression Still BIASED! More training features and/or More complicated model Mean squared error Proportion of training data (15,000 genes)

Random Forest Regression Training correlation: 0.93 Test correlation: 0.52 Log-transformed Training correlation: 0.95 Test correlation: 0.16 Not Log-transformed

Random Forest Regression Candy: What’s the problem – Bias or Variance? What should we do now? Training: R = 0.95 Test: R = 0.52 Observed log2(RPKM+1) Observed log2(RPKM+1) Predicted log2(RPKM+1) Predicted log2(RPKM+1)

What’s the best model setup? Candy: Which setup do we expect to perform better? One bin around TSS Vs. 80 bins around TSS + 80 bins around TTS Cheng et al. Genome Biology 2011

Effects of signal depend on Location! START STOP TTS Slide by Chao Cheng Gerstein*,…, Cheng* et al. 2010, Science Correlation between Signal and expression

Setting up the model …… Predictors TSS (transcription start site) Bin 120-81 (TTS-4kb to TTS) TSS (transcription start site) TTS (transcription terminal site) Gene k Bin 121-160 (TTS to TTS+4kb) Bin 41-80 (TSS to TSS+4kb) Bin 40-1 (TSS-4kb to TSS) 40 1 4 … 41 ….44 80 120 81 121 160 RNA-Seq data ~10000 refseq genes Bin 1 Bin 2 Bin160 Chromatin features: Histone modifications HM1, 2, 3, …… …… Predictors Prediction target: Gene expression level Slide by Chao Cheng

Support vector regression to predict gene expression levels Cite Cheng et al. Genome Research 2013 Slide by Chao Cheng

Context (TA bias) My implementation: Train correlation = 0.95 Test correlation = 0.58

Areas close to TSS predict expression better Support vector machine to classify genes with high, medium and low expression Candy: Describe a ROC curve! Areas close to TSS predict expression better Adapted from Chao Cheng

Predicting Gene Expression with Transcription Factor ChIP-Seq signals Cheng et al. Genome Research 2012

Predicting Gene Expression with Transcription Factor ChIP-Seq signals Cheng et al. Genome Research 2012

Modeling Transcription Between Organisms Gerstein et al. Nature 2014

Why do we care? What are the benefits of a quantitative model? Does this model help us understand the mechanism of transcription?

For discussion Will the prediction model perform accurately in cells with a transcription factor knocked out? Gerstein et al. Nature 2014