CS273B: Deep learning for Genomics and Biomedicine

Slides:

Advertisements

Similar presentations

Methods to read out regulatory functions

Advertisements

Regulomics II: Epigenetics and the histone code Jim Noonan GENE760.

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.

Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.

Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.

Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.

“An integrated encyclopedia of DNA elements in the human genome” ENCODE Project Consortium. Nature 2012 Sep 6; 489: Michael M. Hoffman University.

Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)

[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.

Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.

* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.

SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.

I519 Introduction to Bioinformatics, Fall, 2012

Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.

Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.

Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,

Algorithms in Bioinformatics: A Practical Introduction

Recombination breakpoints Family Inheritance Me vs. my brother My dad (my Y)Mom’s dad (uncle’s Y) Human ancestry Disease risk Genomics: Regions  mechanisms.

Introduction to biological molecular networks

. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.

CS173 Lecture 9: Transcriptional regulation III

Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.

Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.

Transcription factor binding motifs (part II) 10/22/07.

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

1 From Bi 150 Lecture 0 October 4, 2012 An introduction to molecular biology... but you will learn the cell biology in this course.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.

Gene Regulation, Part 2 Lecture 15 (cont.) Fall 2008.

Additional high-throughput sequencing techniques (finding all functional elements of genome) June 15, 2017.

Regulation of Gene Expression

Convolutional Sequence to Sequence Learning

CSCI2950-C Lecture 12 Networks

Data Mining, Neural Network and Genetic Programming

Learning Sequence Motif Models Using Expectation Maximization (EM)

De novo Motif Finding using ChIP-Seq

Lecture 5 Smaller Network: CNN

Neural Networks 2 CS446 Machine Learning.

Chapter 15 Controls over Genes.

Structure of proximal and distant regulatory elements in the human genome Ivan Ovcharenko Computational Biology Branch National Center for Biotechnology.

Computer Vision James Hays

Principles of using neural networks for predicting molecular traits from DNA sequence Principles of using neural networks for predicting molecular traits.

Interpreting noncoding variants

Introduction to Neural Networks

Mitosis & Meiosis Lesson 6.

Albert Xue, Binbin Huang, Jianrong Wang

Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.

Chapter 18: Regulation of Gene Expression

Regulation of Gene Expression

Very Deep Convolutional Networks for Large-Scale Image Recognition

Schedule for the Afternoon

CSC 578 Neural Networks and Deep Learning

Review Warm-Up What is the Central Dogma?

Object Detection Creation from Scratch Samsung R&D Institute Ukraine

Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen

Finding regulatory modules

Neural Networks Geoff Hulten.

Visualizing and Understanding Convolutional Networks

Presented by, Jeremy Logue.

Problems with CNNs and recent innovations 2/13/19

CSC 578 Neural Networks and Deep Learning

Presented by, Jeremy Logue.

Deep Learning in Bioinformatics

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

Image recognition.

Example of training and deployment of deep convolutional neural networks. Example of training and deployment of deep convolutional neural networks. During.

CSC 578 Neural Networks and Deep Learning

Presentation transcript:

CS273B: Deep learning for Genomics and Biomedicine Lecture 2: Convolutional neural networks and applications to functional genomics 09/28/2016 Anshul Kundaje, James Zou, Serafim Batzoglou

Outline Anatomy of the human genome Introduction to next-gen sequence and protein-DNA binding maps Convolutional neural networks for predicting protein-DNA binding maps from DNA sequence Multi-modal convolutional neural networks for predicting protein- DNA binding maps Convolutional neural networks on images

Anatomy of the human genome

2003 The Human Genome ~ 3 billion nucleotides Tgccaagcagcaaagttttgctgctgtttatttttgtagctcttactatattct acttttaccattgaaaatattgaggaagttatttatatttctattttttatatat tatatattttatgtattttaatattactattacacataattattttttatatatatga agtaccaatgacttccttttccagagcaataatgaaatttcacagtatgaaa atggaagaaatcaataaaattatacgtgacctgtggcgaagtacctatcgtg gacaaggtgagtaccatggtgtatcacaaatgctctttccaaagccctctcc gcagctcttccccttatgacctctcatcatgccagcattacctccctggaccc ctttctaagcatgtctttgagattttctaagaattcttatcttggcaacatctt gtagcaagaaaatgtaaagttttctgttccagagcctaacaggacttacata tttgactgcagtaggcattatatttagctgatgacataataggttctgtcata gtgtagatagggataagccaaaatgcaataagaaaaaccatccagaggaa actcttttttttttctttttcttttttttttttccagatggagtctcgcacttc tctgtcacccgggctggagcgcagtggtgcaatcttggctcactgcaacct ccacctcctgggttcaggtgattctcccacctcagcctcccgagtagtagct ggaattacaggtgcgcgctcccacacctggctaattttttgtattcttagta gagatggggtttcaccatgttggccaggctggtctcaaactcctgccctca ggtgatctgcccaccttggcctcccagtgttgggtttacaggcgtgagcca ccgcgcctggcctggaggaaactcttaacagggaaactaagaaagagttg aggctgaggaactggggcatctgggttgcttctggccagaccaccaggct cttgaatcctcccagccagagaaagagtttccacaccagccattgttttcct ctggtaatgtcagcctcatctgttgttcctaggcttacttgatatgtttgtaa atgacaaaaggctacagagcataggttcctctaaaatattcttcttcctgtgt cagatattgaatacatagaaatacggtctgatgccgatgaaaatgtatcagct tctgataaaaggcggaattataactaccgagtggtgatgctgaagggagac acagccttggatatgcgaggacgatgcagtgctggacaaaaggcaggtat ctcaaaagcctggggagccaactcacccaagtaactgaaagagagaaaca aacatcagtgcagtggaagcacccaaggctacacctgaatggtgggaagc tctttgctgctatataaaatgaatcaggctcagctactattatt ………… 2003 ~ 3 billion nucleotides The Human Genome

DNA: the molecule of heredity Tgccaagca ||||||||||| ACGGTTCGT Forward strand Reverse complement 5’ 3’ Double helix (double stranded)

Chromosomes in humans TgccaAgca ||||||||||| ACGGTTCGT TgccaAgca Humans are diploid (2 copies of each chromosome) 22 pairs of autosomes Sex chromosomes: female (X,X) , male (X,Y) Mitochondrial DNA (circular, many copies per cell) Diploid Human genome = ~3 billion bp X 2

Functional elements in the genome mRNA Protein Transcription factors (Regulatory proteins) Chromatin (epigenetic) modifications Active Gene Nucleosomes Promoter Enhancer Motif Insulator DNA Repressed Gene Ecker et al. 2012

One genome  Many cell types ACCAGTTACGACGGTCAGGGTACTGATACCCCAAACCGTTGACCGCATTTACAGACGGGGTTTGGGTTTTGCCCCACACAGGTACGTTAGCTACTGGTTTAGCAATTTACCGTTACAACGTTTACAGGGTTACGGTTGGGATTTGAAAAAAAGTTTGAGTTGGTTTTTTCACGGTAGAACGTACCGTTACCAGTA

Introduction to functional genomics & next-gen sequencing

What is Functional Genomics? GGCAATACGATATTAGCAAATAAACGATAGTATACAAATCGTATTAC... :-) ~ 3 billion bases Genome Assembly 2003 Genomic sequence => Static What is the context-specific function of different regions (bases) of the genome? How to explain diversity of cell-types? How to explain dynamic cellular repsonse?

Sequencing technologies

Slides from Ben Langmead

Slides from Ben Langmead

Slides from Ben Langmead

Slides from Ben Langmead

Slides from Ben Langmead

Mapping short-reads to reference genome Naïve method Scan whole genome with every read Problem: Too slow Indexing + Alignment approach Create a compressed reference ‘genome index’ a map of where each short subsequence of length ‘k’ hits the genome Map reads using index via smart alignment algorithms and data structures (e.g suffix array) Allow for errors: insertions, deletions, mismatches in alignments Run times for indexing alignment Indexing human genome ~ 3 hours Alignment speed: 2 million 35 bp reads on 1 processor ~20 mins Alignment speed depends on error rate http://www.nature.com/nrg/journal/v14/n5/box/nrg3433_BX2.html

Using sequencing for functional genomics Genome-wide maps of biochemical activity Genome-wide expts. Protein-DNA binding maps chromatin modification maps Nucleosome positioning maps RNA expression Repressed Gene Transcription factors (Regulatory proteins) Enhancer Insulator Promoter mRNA Protein Active Gene Nucleosomes Chromatin modifications Cellular Dynamics Different cell-types/tissues Diseased states (e.g. cancer) Different perturbations (stimuli)

Protein-DNA binding maps Chromatin immunoprecipitation (ChIP-seq) Protein-DNA binding maps Maps of histone modifications Maps of histone variants

Genome-wide ChIP-seq signal maps Transcription factor binding map Chromatin modification maps

DNA sequence determinants of protein-DNA interactions

TRANSCRIPTION FACTOR BINDING Key properties of regulatory sequence TRANSCRIPTION FACTOR BINDING Regulatory proteins called transcription factors (TFs) bind to high affinity sequence patterns (motifs) in regulatory DNA Regulatory DNA sequences Transcription factor Motif Ecker et al. 2012

Sequence motifs Position weight matrix (PWM) PWM logo GGATAA CGATAA 1 0.5 C G T Position weight matrix (PWM) Bits PWM logo https://en.wikipedia.org/wiki/Sequence_logo GGATAA CGATAA CGATAT GGATAT Set of aligned sequences Bound by TF Move scorng part to separate slide Ecker et al. 2012

Sequence motifs Accounting for genomic background nucleotide distribution A -5.7 -3.2 3.7 0.6 C 0.5 G T Position-specific scoring matrix (PSSM) Move scorng part to separate slide PSSM logo Ecker et al. 2012

Scoring a sequence with a motif PSSM PSSM parameters A -5.7 -3.2 3.7 0.6 C 0.5 G T Scoring weights W G C A T Input sequence One-hot encoding (X)

Convolution: Scoring a sequence with a PSSM Motif match Scores sum(W * x) -5.4 A -5.7 -3.2 3.7 0.6 C 0.5 G T Scoring weights W G C A T Input sequence One-hot encoding (X)

Convolution G C A T Motif match Scores sum(W * x) Scoring weights W -5.4 2.0 A -5.7 -3.2 3.7 0.6 C 0.5 G T Scoring weights W G C A T Input sequence One-hot encoding (X)

Convolution G C A T Motif match Scores sum(W * x) Scoring weights W -2.2 -5.4 2.0 -4.3 -24 -17 -18 -11 -12 16 -5.5 -8.5 -5.2 A -5.7 -3.2 3.7 0.6 C 0.5 G T Scoring weights W Move thersholding into separate slide G C A T Input sequence One-hot encoding (X)

Thresholded Motif Scores Thresholding scores Thresholded Motif Scores max(0, W*x) 2.0 16 Motif match Scores W*x -2.2 -5.4 2.0 -4.3 -24 -17 -18 -11 -12 16 -5.5 -8.5 -5.2 A -5.7 -3.2 3.7 0.6 C 0.5 G T Scoring weights W Move thersholding into separate slide G C A T Input sequence One-hot encoding (X)

Convolutional neural networks for learning from DNA sequence

Learning patterns in regulatory DNA sequence Positive class of genomic sequences bound a transcription factor of interest Negative class of genomic sequences not bound by a transcription factor of interest TF Can we learn patterns in the DNA sequence that distinguish these 2 classes of genomic sequences?

HOMOTYPIC MOTIF DENSITY Key properties of regulatory sequence HOMOTYPIC MOTIF DENSITY Regulatory sequences often contain more than one binding instance of a TF resulting in homotypic clusters of motifs of the same TF Ecker et al. 2012

HETEROTYPIC MOTIF COMBINATIONS Key properties of regulatory sequence HETEROTYPIC MOTIF COMBINATIONS Regulatory sequences often bound by combinations of TFs resulting in heterotypic clusters of motifs of different TFs Ecker et al. 2012

SPATIAL GRAMMARS OF HETEROTYPIC MOTIF COMBINATIONS Key properties of regulatory sequence SPATIAL GRAMMARS OF HETEROTYPIC MOTIF COMBINATIONS Regulatory sequences are often bound by combinations of TFs with specific spatial and positional constraints resulting in distinct motif grammars Ecker et al. 2012

(An artificial neuron) A simple classifier (An artificial neuron) parameters Linear function Z Training the neuron means learning the optimal w’s and b

(An artificial neuron) A simple classifier (An artificial neuron) Logistic / Sigmoid Useful for predicting probabilities parameters Non-linear function Y Training the neuron means learning the optimal w’s and b

(An artificial neuron) A simple classifier (An artificial neuron) ReLu (Rectified Linear Unit) Useful for thresholding parameters Non-linear function Y Training the neuron means learning the optimal w’s and b

Artificial neuron can represent a motif parameters -2.2 -5.4 2.0 -4.3 -24 -17 Y

Biological motivation of DCNN Threshold scores using ReLU Max pool thresholded scores over windows Predict probabilities using logistic neuron Scan sequence using filters Convolutional filters learn motifs (PSSM)

Deep convolutional neural network Sigmoid activations P (TF = bound | X) Typically followed by one or more fully connected layers Maxpooling layers take the max over sets of conv layer outputs Maxpooling layer pool width = 2 stride = 1 max Max = 2 Max = 6 Conv Layer 2 Kernel width = 3 stride = 1 num filters / num channels = 2 total neurons = 6 Later conv layers operate on outputs of previous conv layers 1 2 6 Convolutional layer (same color = shared weights) Conv Layer 1 Kernel width = 4 stride = 2* num filters / num channels = 3 Total neurons = 15 G C A T *for genomics, a stride of 1 for conv layers is recommended

Multi-task CNN G C A T Multi-task output (sigmoid activations here) P (TF1 = bound | X) P (TF2 = bound | X) Typically followed by one or more fully connected layers Maxpooling layers take the max over sets of conv layer outputs Maxpooling layer pool width = 2 stride = 1 max Max = 2 Max = 6 Conv Layer 2 Kernel width = 3 stride = 1 num filters / num channels = 2 total neurons = 6 Later conv layers operate on outputs of previous conv layers 1 2 6 Convolutional layer (same color = shared weights) Conv Layer 1 Kernel width = 4 stride = 2* num filters / num channels = 3 Total neurons = 15 G C A T

http://www.nature.com/nbt/journal/v33/n8/full/nbt.3300.html

Regulatory DNA sequence simulator + simple CNN models + hands tutorial http://kundajelab.github.io/dragonn/ Add slide directing to static notebook if login fails Many open questions on what are optimal CNN (or other deep learning) architectures for learning from DNA sequence data

http://dreamchallenges.org/

Additional optional readings

In Canvas https://canvas.stanford.edu/courses/51037/files/folder/LectureMaterial/Lecture2