CS273B: Deep learning for Genomics and Biomedicine

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

Regulomics II: Epigenetics and the histone code Jim Noonan GENE760.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
“An integrated encyclopedia of DNA elements in the human genome” ENCODE Project Consortium. Nature 2012 Sep 6; 489: Michael M. Hoffman University.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
I519 Introduction to Bioinformatics, Fall, 2012
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,
Algorithms in Bioinformatics: A Practical Introduction
Recombination breakpoints Family Inheritance Me vs. my brother My dad (my Y)Mom’s dad (uncle’s Y) Human ancestry Disease risk Genomics: Regions  mechanisms.
Introduction to biological molecular networks
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
CS173 Lecture 9: Transcriptional regulation III
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Transcription factor binding motifs (part II) 10/22/07.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
1 From Bi 150 Lecture 0 October 4, 2012 An introduction to molecular biology... but you will learn the cell biology in this course.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Gene Regulation, Part 2 Lecture 15 (cont.) Fall 2008.
Additional high-throughput sequencing techniques (finding all functional elements of genome) June 15, 2017.
Projects
Regulation of Gene Expression
Convolutional Sequence to Sequence Learning
CSCI2950-C Lecture 12 Networks
Demo.
Data Mining, Neural Network and Genetic Programming
Learning Sequence Motif Models Using Expectation Maximization (EM)
De novo Motif Finding using ChIP-Seq
Lecture 5 Smaller Network: CNN
Neural Networks 2 CS446 Machine Learning.
Chapter 15 Controls over Genes.
Structure of proximal and distant regulatory elements in the human genome Ivan Ovcharenko Computational Biology Branch National Center for Biotechnology.
Computer Vision James Hays
Principles of using neural networks for predicting molecular traits from DNA sequence Principles of using neural networks for predicting molecular traits.
Interpreting noncoding variants
Introduction to Neural Networks
Mitosis & Meiosis Lesson 6.
Albert Xue, Binbin Huang, Jianrong Wang
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
Chapter 18: Regulation of Gene Expression
Regulation of Gene Expression
Very Deep Convolutional Networks for Large-Scale Image Recognition
Schedule for the Afternoon
CSC 578 Neural Networks and Deep Learning
Review Warm-Up What is the Central Dogma?
Object Detection Creation from Scratch Samsung R&D Institute Ukraine
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen
Finding regulatory modules
Neural Networks Geoff Hulten.
Visualizing and Understanding Convolutional Networks
Presented by, Jeremy Logue.
Problems with CNNs and recent innovations 2/13/19
CSC 578 Neural Networks and Deep Learning
Presented by, Jeremy Logue.
Deep Learning in Bioinformatics
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
Image recognition.
Example of training and deployment of deep convolutional neural networks. Example of training and deployment of deep convolutional neural networks. During.
CSC 578 Neural Networks and Deep Learning
Presentation transcript:

CS273B: Deep learning for Genomics and Biomedicine Lecture 2: Convolutional neural networks and applications to functional genomics 09/28/2016 Anshul Kundaje, James Zou, Serafim Batzoglou

Outline Anatomy of the human genome Introduction to next-gen sequence and protein-DNA binding maps Convolutional neural networks for predicting protein-DNA binding maps from DNA sequence Multi-modal convolutional neural networks for predicting protein- DNA binding maps Convolutional neural networks on images

Anatomy of the human genome

2003 The Human Genome ~ 3 billion nucleotides Tgccaagcagcaaagttttgctgctgtttatttttgtagctcttactatattct acttttaccattgaaaatattgaggaagttatttatatttctattttttatatat tatatattttatgtattttaatattactattacacataattattttttatatatatga agtaccaatgacttccttttccagagcaataatgaaatttcacagtatgaaa atggaagaaatcaataaaattatacgtgacctgtggcgaagtacctatcgtg gacaaggtgagtaccatggtgtatcacaaatgctctttccaaagccctctcc gcagctcttccccttatgacctctcatcatgccagcattacctccctggaccc ctttctaagcatgtctttgagattttctaagaattcttatcttggcaacatctt gtagcaagaaaatgtaaagttttctgttccagagcctaacaggacttacata tttgactgcagtaggcattatatttagctgatgacataataggttctgtcata gtgtagatagggataagccaaaatgcaataagaaaaaccatccagaggaa actcttttttttttctttttcttttttttttttccagatggagtctcgcacttc tctgtcacccgggctggagcgcagtggtgcaatcttggctcactgcaacct ccacctcctgggttcaggtgattctcccacctcagcctcccgagtagtagct ggaattacaggtgcgcgctcccacacctggctaattttttgtattcttagta gagatggggtttcaccatgttggccaggctggtctcaaactcctgccctca ggtgatctgcccaccttggcctcccagtgttgggtttacaggcgtgagcca ccgcgcctggcctggaggaaactcttaacagggaaactaagaaagagttg aggctgaggaactggggcatctgggttgcttctggccagaccaccaggct cttgaatcctcccagccagagaaagagtttccacaccagccattgttttcct ctggtaatgtcagcctcatctgttgttcctaggcttacttgatatgtttgtaa atgacaaaaggctacagagcataggttcctctaaaatattcttcttcctgtgt cagatattgaatacatagaaatacggtctgatgccgatgaaaatgtatcagct tctgataaaaggcggaattataactaccgagtggtgatgctgaagggagac acagccttggatatgcgaggacgatgcagtgctggacaaaaggcaggtat ctcaaaagcctggggagccaactcacccaagtaactgaaagagagaaaca aacatcagtgcagtggaagcacccaaggctacacctgaatggtgggaagc tctttgctgctatataaaatgaatcaggctcagctactattatt ………… 2003 ~ 3 billion nucleotides The Human Genome

DNA: the molecule of heredity Tgccaagca ||||||||||| ACGGTTCGT Forward strand Reverse complement 5’ 3’ Double helix (double stranded)

Chromosomes in humans TgccaAgca ||||||||||| ACGGTTCGT TgccaAgca Humans are diploid (2 copies of each chromosome) 22 pairs of autosomes Sex chromosomes: female (X,X) , male (X,Y) Mitochondrial DNA (circular, many copies per cell) Diploid Human genome = ~3 billion bp X 2

Functional elements in the genome mRNA Protein Transcription factors (Regulatory proteins) Chromatin (epigenetic) modifications Active Gene Nucleosomes Promoter Enhancer Motif Insulator DNA Repressed Gene Ecker et al. 2012

One genome  Many cell types ACCAGTTACGACGGTCAGGGTACTGATACCCCAAACCGTTGACCGCATTTACAGACGGGGTTTGGGTTTTGCCCCACACAGGTACGTTAGCTACTGGTTTAGCAATTTACCGTTACAACGTTTACAGGGTTACGGTTGGGATTTGAAAAAAAGTTTGAGTTGGTTTTTTCACGGTAGAACGTACCGTTACCAGTA

Introduction to functional genomics & next-gen sequencing

What is Functional Genomics? GGCAATACGATATTAGCAAATAAACGATAGTATACAAATCGTATTAC... :-) ~ 3 billion bases Genome Assembly 2003 Genomic sequence => Static What is the context-specific function of different regions (bases) of the genome? How to explain diversity of cell-types? How to explain dynamic cellular repsonse?

Sequencing technologies

Slides from Ben Langmead

Slides from Ben Langmead

Slides from Ben Langmead

Slides from Ben Langmead

Slides from Ben Langmead

Mapping short-reads to reference genome Naïve method Scan whole genome with every read Problem: Too slow Indexing + Alignment approach Create a compressed reference ‘genome index’ a map of where each short subsequence of length ‘k’ hits the genome Map reads using index via smart alignment algorithms and data structures (e.g suffix array) Allow for errors: insertions, deletions, mismatches in alignments Run times for indexing alignment Indexing human genome ~ 3 hours Alignment speed: 2 million 35 bp reads on 1 processor ~20 mins Alignment speed depends on error rate http://www.nature.com/nrg/journal/v14/n5/box/nrg3433_BX2.html

Using sequencing for functional genomics Genome-wide maps of biochemical activity Genome-wide expts. Protein-DNA binding maps chromatin modification maps Nucleosome positioning maps RNA expression Repressed Gene Transcription factors (Regulatory proteins) Enhancer Insulator Promoter mRNA Protein Active Gene Nucleosomes Chromatin modifications Cellular Dynamics Different cell-types/tissues Diseased states (e.g. cancer) Different perturbations (stimuli)

Protein-DNA binding maps Chromatin immunoprecipitation (ChIP-seq) Protein-DNA binding maps Maps of histone modifications Maps of histone variants

Genome-wide ChIP-seq signal maps Transcription factor binding map Chromatin modification maps

DNA sequence determinants of protein-DNA interactions

TRANSCRIPTION FACTOR BINDING Key properties of regulatory sequence TRANSCRIPTION FACTOR BINDING Regulatory proteins called transcription factors (TFs) bind to high affinity sequence patterns (motifs) in regulatory DNA Regulatory DNA sequences Transcription factor Motif Ecker et al. 2012

Sequence motifs Position weight matrix (PWM) PWM logo GGATAA CGATAA 1 0.5 C G T Position weight matrix (PWM) Bits PWM logo https://en.wikipedia.org/wiki/Sequence_logo GGATAA CGATAA CGATAT GGATAT Set of aligned sequences Bound by TF Move scorng part to separate slide Ecker et al. 2012

Sequence motifs Accounting for genomic background nucleotide distribution A -5.7 -3.2 3.7 0.6 C 0.5 G T Position-specific scoring matrix (PSSM) Move scorng part to separate slide PSSM logo Ecker et al. 2012

Scoring a sequence with a motif PSSM PSSM parameters A -5.7 -3.2 3.7 0.6 C 0.5 G T Scoring weights W G C A T Input sequence One-hot encoding (X)

Convolution: Scoring a sequence with a PSSM Motif match Scores sum(W * x) -5.4 A -5.7 -3.2 3.7 0.6 C 0.5 G T Scoring weights W G C A T Input sequence One-hot encoding (X)

Convolution G C A T Motif match Scores sum(W * x) Scoring weights W -5.4 2.0 A -5.7 -3.2 3.7 0.6 C 0.5 G T Scoring weights W G C A T Input sequence One-hot encoding (X)

Convolution G C A T Motif match Scores sum(W * x) Scoring weights W -2.2 -5.4 2.0 -4.3 -24 -17 -18 -11 -12 16 -5.5 -8.5 -5.2 A -5.7 -3.2 3.7 0.6 C 0.5 G T Scoring weights W Move thersholding into separate slide G C A T Input sequence One-hot encoding (X)

Thresholded Motif Scores Thresholding scores Thresholded Motif Scores max(0, W*x) 2.0 16 Motif match Scores W*x -2.2 -5.4 2.0 -4.3 -24 -17 -18 -11 -12 16 -5.5 -8.5 -5.2 A -5.7 -3.2 3.7 0.6 C 0.5 G T Scoring weights W Move thersholding into separate slide G C A T Input sequence One-hot encoding (X)

Convolutional neural networks for learning from DNA sequence

Learning patterns in regulatory DNA sequence Positive class of genomic sequences bound a transcription factor of interest Negative class of genomic sequences not bound by a transcription factor of interest TF Can we learn patterns in the DNA sequence that distinguish these 2 classes of genomic sequences?

HOMOTYPIC MOTIF DENSITY Key properties of regulatory sequence HOMOTYPIC MOTIF DENSITY Regulatory sequences often contain more than one binding instance of a TF resulting in homotypic clusters of motifs of the same TF Ecker et al. 2012

HETEROTYPIC MOTIF COMBINATIONS Key properties of regulatory sequence HETEROTYPIC MOTIF COMBINATIONS Regulatory sequences often bound by combinations of TFs resulting in heterotypic clusters of motifs of different TFs Ecker et al. 2012

SPATIAL GRAMMARS OF HETEROTYPIC MOTIF COMBINATIONS Key properties of regulatory sequence SPATIAL GRAMMARS OF HETEROTYPIC MOTIF COMBINATIONS Regulatory sequences are often bound by combinations of TFs with specific spatial and positional constraints resulting in distinct motif grammars Ecker et al. 2012

(An artificial neuron) A simple classifier (An artificial neuron)   parameters   Linear function Z   Training the neuron means learning the optimal w’s and b

(An artificial neuron) A simple classifier (An artificial neuron) Logistic / Sigmoid Useful for predicting probabilities   parameters     Non-linear function   Y     Training the neuron means learning the optimal w’s and b

(An artificial neuron) A simple classifier (An artificial neuron) ReLu (Rectified Linear Unit) Useful for thresholding   parameters       Non-linear function   Y     Training the neuron means learning the optimal w’s and b

Artificial neuron can represent a motif   parameters -2.2 -5.4 2.0 -4.3 -24 -17     Y  

Biological motivation of DCNN Threshold scores using ReLU Max pool thresholded scores over windows Predict probabilities using logistic neuron Scan sequence using filters Convolutional filters learn motifs (PSSM)

Deep convolutional neural network Sigmoid activations P (TF = bound | X) Typically followed by one or more fully connected layers Maxpooling layers take the max over sets of conv layer outputs Maxpooling layer pool width = 2 stride = 1 max Max = 2 Max = 6 Conv Layer 2 Kernel width = 3 stride = 1 num filters / num channels = 2 total neurons = 6 Later conv layers operate on outputs of previous conv layers 1 2 6 Convolutional layer (same color = shared weights) Conv Layer 1 Kernel width = 4 stride = 2* num filters / num channels = 3 Total neurons = 15 G C A T *for genomics, a stride of 1 for conv layers is recommended

Multi-task CNN G C A T Multi-task output (sigmoid activations here) P (TF1 = bound | X) P (TF2 = bound | X) Typically followed by one or more fully connected layers Maxpooling layers take the max over sets of conv layer outputs Maxpooling layer pool width = 2 stride = 1 max Max = 2 Max = 6 Conv Layer 2 Kernel width = 3 stride = 1 num filters / num channels = 2 total neurons = 6 Later conv layers operate on outputs of previous conv layers 1 2 6 Convolutional layer (same color = shared weights) Conv Layer 1 Kernel width = 4 stride = 2* num filters / num channels = 3 Total neurons = 15 G C A T

http://www.nature.com/nbt/journal/v33/n8/full/nbt.3300.html

Regulatory DNA sequence simulator + simple CNN models + hands tutorial http://kundajelab.github.io/dragonn/ Add slide directing to static notebook if login fails Many open questions on what are optimal CNN (or other deep learning) architectures for learning from DNA sequence data

http://dreamchallenges.org/

Additional optional readings

In Canvas https://canvas.stanford.edu/courses/51037/files/folder/LectureMaterial/Lecture2