[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 15:

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Methods to read out regulatory functions
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Analysis of ChIP-Seq Data
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Profs: Serafim Batzoglou, Gill Bejerano TAs: Cory McLean, Aaron Wenger
Predicting protein functions from redundancies in large-scale protein interaction networks Speaker: Chun-hui CAI
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
CS 374: Relating the Genetic Code to Gene Expression Sandeep Chinchali.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TAs: Cory McLean, Aaron Wenger.
Bryan Heck Tong Ihn Lee et al Transcriptional Regulatory Networks in Saccharomyces cerevisiae.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
CS273A Lecture 5: Genes Enrichment, Gene Regulation I
CS173 Lecture 14: Personal Genomics, GSEA/GREAT
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 8:
[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
Gene Set Enrichment Analysis (GSEA)
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Reconstruction of Transcriptional Regulatory Networks
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Identification of Compositionally Similar Cis-element Clusters in Coordinately Regulated Genes Anil G Jegga, Ashima Gupta, Andrew T Pinski, James W Carman,
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Structure-aided prediction of mammalian transcription factor complexes in conserved non-coding elements by Harendra Guturu, Andrew C. Doxey, Aaron M. Wenger,
[Bejerano Fall10/11] 1.
Statistical Testing with Genes Saurabh Sinha CS 466.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
DNAmRNAProtein Small molecules Environment Regulatory RNA How a cell is wired The dynamics of such interactions emerge as cellular processes and functions.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Motif Search and RNA Structure Prediction Lesson 9.
CS173 Lecture 9: Transcriptional regulation III
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Motif Finding Continued
Yiming Kang, Hien-haw Liow, Ezekiel Maier, & Michael Brent
ANIMAL TARGET PREDICTION - TIPS
Babak Alipanahi1, Andrew Delong, Matthew T Weirauch & Brendan J Frey
Statistical Testing with Genes
Learning Sequence Motif Models Using Expectation Maximization (EM)
Protein Occupancy Landscape of a Bacterial Genome
Evolutionary Rewiring of Human Regulatory Networks by Waves of Genome Expansion  Davide Marnetto, Federica Mantica, Ivan Molineris, Elena Grassi, Igor.
In collaboration with Mikkelsen Lab
The Human Genome Source Code
Statistical Testing with Genes
Volume 52, Issue 1, Pages (October 2013)
IMPACT: Genomic Annotation of Cell-State-Specific Regulatory Elements Inferred from the Epigenome of Bound Transcription Factors  Tiffany Amariuta, Yang.
Transcriptional and genomic targets of EN1 in TNBC cells.
Presentation transcript:

[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 15: TF Motifs (Harendra)

Project milestones due Today [BejeranoWinter12/13] 2 Announcements

Review: Transcriptional regulation of genes Transcription Start Site (TSS) Thousands of transcription factor-CRM interactions that control gene expression in each cell type [BejeranoWinter12/13]3 Enhancer (CRM)

[BejeranoWinter12/13] 4 Last Time: ChIP-Seq - a first glimpses of the regulatory genome in action Cis-regulatory peak 4 Peak Calling

Gene transcription start site SRF binding ChIP-seq peak Ontology term (e.g. ‘actin cytoskeleton’) [BejeranoWinter12/13]5 Last Time: Infer functions of ChIP-seq binding profile using GREAT GREAT = Genomic Regions Enrichment of Annotations Tool P = Pr binom (k ≥5 | n=6, p =0.33) p = 0.33 of genome annotated with n = 6 genomic regions k = 5 genomic regions hit annotation

[BejeranoWinter12/13] 6 GREAT gives you a tables of functions Ontology Term # Genes Binomial Experimental P-value support * Gene Ontology actin cytoskeleton actin binding 7x x10 -5 Miano et al * Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT Pathway Commons TRAIL signaling Class I PI3K signaling 5x x10 -6 Bertolotto et al Poser et al TreeFam 1x Chai & Tarnawski 2002 TF Targets Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 5x x x x10 -4 Positive control ChIp-Seq support Natesan & Gilman Top GREAT enrichments of SRF FOS gene family

Gene transcription start site SRF binding ChIP-seq peak Ontology term (e.g. ‘actin binding’) [BejeranoWinter12/13]7 Last Time: Infer functions of ChIP-seq binding profile using GREAT GREAT = Genomic Regions Enrichment of Annotations Tool P = Pr binom (k ≥4 | n=6, p =0.5) p = 0.5 of genome annotated with n = 6 genomic regions k = 4 genomic regions hit annotation π π π π π π π

[BejeranoWinter12/13] 8 GREAT gives you a tables of functions Ontology Term # Genes Binomial Experimental P-value support * Gene Ontology actin cytoskeleton actin binding 7x x10 -5 Miano et al * Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT Pathway Commons TRAIL signaling Class I PI3K signaling 5x x10 -6 Bertolotto et al Poser et al TreeFam 1x Chai & Tarnawski 2002 TF Targets Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 5x x x x10 -4 Positive control ChIp-Seq support Natesan & Gilman Top GREAT enrichments of SRF FOS gene family

[BejeranoWinter12/13] 9 GREAT gives you a tables of functions Ontology Term # Genes Binomial Experimental P-value support * Gene Ontology actin cytoskeleton actin binding 7x x10 -5 Miano et al * Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT Pathway Commons TRAIL signaling Class I PI3K signaling 5x x10 -6 Bertolotto et al Poser et al TreeFam 1x Chai & Tarnawski 2002 TF Targets Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 5x x x x10 -4 Positive control ChIp-Seq support Natesan & Gilman Top GREAT enrichments of SRF FOS gene family Different

Hard or impossible to get the required cells Some cells don’t occur in enough quantity to ChIP Others are hard to dissect Certain human tissues are hard to obtain Hard to get a good antibody Ex: We have ChIP results for a factor in brain We have not be able to repeat it since we can’t find the same antibody Lots of time and money to do one experiment Only information for one context – cell type or time Can we computationally predict the binding sites for many contexts and factors? [BejeranoWinter12/13]10 But doing the experiment is the hard part!

[BejeranoWinter12/13] 11 Recall: TFBS Position Weight Matrix (PWM) Alignment (count) Matrix A C G T Frequency Weight Matrix A C G T ConsATGGCATG Experimentally determined sites ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTCGACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Can we use a PWM to predict where the TF will bind in the genome (without doing ChIP-seq)?

[BejeranoWinter12/13]12 Binding Site Prediction using Match Problem: High number of false positives.

[BejeranoWinter12/13] 13 Recall: TFBS Position Weight Matrix (PWM) Alignment (count) Matrix A C G T Frequency Weight Matrix A C G T ConsATGGCATG Experimentally determined sites ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTCGACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Information content of each column Information content of a motif = sum of all columns = = 6.0

[BejeranoWinter12/13]14 Information content is a measure of motif specificity SRF REST SPIB (IC ~ 12) (IC ~ 5) (IC ~ 25) How do these compare to a library of many PWMs?

[BejeranoWinter12/13]15 PWMs have a range of information content SRF REST SPIB

Measure of motif specificity 16 Information content determines how accurately we can predict the binding site SRF 2 million [BejeranoWinter12/13]

Measure of motif specificity 17 Information content determines how accurately we can predict the binding site SRF 2 million 2 million matches to the SRF motif, but ChIP-seq and other estimates suggest ≈ 10,000 actual binding sites [BejeranoWinter12/13] Can we do better?

[BejeranoWinter12/13] 18 Use excess conservation to improve prediction accuracy Aaron Shoa Wenger et al., PRISM offers a comprehensive genomic approach to transcription factor function prediction. 2013

Use shuffled motifs to calculate confidence of excess conservation binding site prediction 19 [BejeranoWinter12/13] shuffled real branch length (subst / site) fraction conserved Confidence is the fraction conserved in excess. excess = 0.12 total = 0.32 confidence = excess / total Transcription factor motif Genome-wide binding site predictions 10 Shuffled Transcription factor motifs Genome-wide binding site predictions

Probabilistic interpretation Confidence is the probability that a motif instance is functional given its observed conservation. 20 Pr R (functional | C ≥ c)= 1 - Pr R (not functional | C ≥ c) Pr R (C ≥ c | not F) Pr R (not F) Pr R (C ≥ c) = 1 - branch length (subst / site) Pr R (C ≥ c) Pr S (C ≥ c) R: real motif S: average shuffled motif Pr R (C ≥ 1.5) = 0.2 Pr S (C ≥ c) Pr R (not F) Pr R (C ≥ c) = 1 - Pr R (C ≥ c) - Pr S (C ≥ c) Pr R (not F) Pr R (C ≥ c) = Pr R (C ≥ c) - Pr S (C ≥ c) Pr R (C ≥ c) ≈ excess total = [BejeranoWinter12/13]

Excess conservation score defined by genomic background 21http://cs173.stanford.edu [BejeranoWinter12/13]

Excess conservation score also defined by motif [BejeranoWinter12/13]22

ARE THE PREDICTIONS ANY GOOD? Perform genome-wide binding site predictions… [BejeranoWinter12/13]23

[BejeranoWinter12/13]24 Use ChIP-seq overlap as a measure of sensitivity Genome-wide binding site predictions for one factor (Ex: E2F4) ChIP-seq for same factor (Ex: E2F4) Sensitivity = Overlapping ChIP-peaks / Total ChIP-peaks But how do you assess if your overlap is good? Compare to the best tool out there (or all the tools, if there is no “best”)

Excess conservation binding site prediction is more accurate than existing methods 25http://cs173.stanford.edu [BejeranoWinter12/13] (prior state of the art)

26 Excess conservation captures binding site profile similar to ChIP-seq ChIP-seqMotifMap PRISM conservation (% identity) [BejeranoWinter12/13]

Now we have good genome-wide binding site predictions for many factors Lets submit them to GREAT and find out what they are doing… [BejeranoWinter12/13]27 Submit predictions to GREAT

Transcription factorOntologyTop-ranked biological contextGREAT rank for ChIP-seqExperimental support GABPAGO Biological Processtranslation2(Genuario and Perry, 1996) GO Cellular Componentmembrane coat14Novel GO Molecular Functiontranslation initiation factor activity4(Genuario and Perry, 1996) Mouse Phenotypesincreased single-positive T cell numberNone(Yu et al., 2010) PANTHER Pathwaygeneral transcription by RNA polymerase I1(Hauck et al., 2002) Pathway Commonstranscription3(Hauck et al., 2002) REST (NRSF)GO Biological Processneurotransmitter transport1(Schoenherr et al., 1996) GO Cellular Componentneuronal cell bodyNone(Schoenherr et al., 1996) GO Molecular Functioncation channel activity1(Schoenherr et al., 1996) Mouse Phenotypesabnormal synaptic transmission1(Schoenherr et al., 1996) PANTHER Pathwaysynaptic vesicle trafficking2(Schoenherr et al., 1996) Pathway Commonstransmission across chemical synapses3(Schoenherr et al., 1996) SRFGO Biological Processmuscle structure developmentNone(Miano et al., 2007) In JurkatGO Cellular Componentactin cytoskeleton1(Miano et al., 2007) GO Molecular Functionstructural constituent of muscleNone(Miano et al., 2007) Mouse Phenotypesdilated heart ventriclesNone(Parlakian et al., 2004) PANTHER Pathwaycytoskeletal regulation by Rho GTPaseNone(Hill et al., 1995) Pathway Commonsregulation of insulin secretion by acetylcholineNoneNovel STAT3GO Biological Processnegative regulation of signal transductionNone(Naka et al., 1997) In mESCGO Molecular Functiontransforming growth factor beta bindingNone(Kinjyo et al., 2006) Mouse Phenotypesabnormal spleen B cell follicle morphologyNone(Schmidlin et al., 2009) Pathway CommonsSignaling events mediated by TCPTPNone(Yamamoto et al., 2002) Comparing binding site prediction to ChIP-seq 28http://cs173.stanford.edu [BejeranoWinter12/13]

TFfunctionp-valuetarget genes SRFmuscle structure development7.43× PRISM re-discovers known functions GLI2skeletal system development7.07× CRXretinal photoreceptor degeneration1.30× ARabnormal spermiogenesis1.19× Is the number of re-discovered known functions impressive? [BejeranoWinter12/13]

Evaluate re-discovery of known function using “closed loops” How can we assess if the functional associations predicted by PRISM for a particular TF are reasonable without reading a lot of papers? One way is to check if the TFs are annotated with the function (form a closed loop) 30 SRF Genes involved in “muscle structure development” SRF Is SRF itself annotated with the term “muscle structure development”? YES – a “closed loop” [BejeranoWinter12/13]

31 PRISM predictions are consistent with known transcription factor biology [BejeranoWinter12/13] Null Model: How many closed loops using 50,000 random shuffled PWM libraries?

1.Incomplete annotation 2.“Regulation of” annotation 32 Many non-closed loops are still true TFfunctionp-valuetarget genes GATA6abnormal pancreas development5.69× SRFactin cytoskeleton4.84× Nature Genetics, December SRF acts in the nucleus, where it regulates actin cytoskeleton genes. [BejeranoWinter12/13]

Now we have good genome-wide binding site predictions for many factors AND we have functional predictions without ChIP-seq Was it as easy as creating binding sites and submitting the results to GREAT? …not quite… [BejeranoWinter12/13]33 Raw GREAT results need cleaning for conserved TFBS

Shuffled motifs also give GREAT enrichments 34http://cs173.stanford.edu [BejeranoWinter12/13] Examine closely Transcription factor motif Genome-wide binding site predictions 10 Shuffled Transcription factor motifs Genome-wide binding site predictions Run GREAT and observe biological function Filter PRISM

[BejeranoWinter12/13]35 Shuffled motifs are used to create a “E-value” metric to black list enrichments that show up for shuffles Stage 1: GREAT on binding site prediction Stage 2: Top significant GREAT terms Stage 3: PRISM terms (via black listing) Obtained = GREATKeptKept = PRISMPRISM vs. GREAT on b.s. prediction # TF-term associations 31,946 7,529 1,658GREAT predictions kept5.2% TF-term FDR50.5%49.5%16.4%FDR improvement308% closed loop %3.3%5.3%10.9%fraction loops improvement329% (from shuffles) What are all the terms we are throwing away?

[BejeranoWinter12/13]36 GREAT enrichments from shuffles are due to conservation bias Shuffles (2488) CNEs (2279) Create 10,000 random sets of random conserved non-coding regions Run GREAT How do the enrichments compared to those from shuffled motifs? Pro: E-value helps us get more accurate predictions by removing false predictions Con: Conservation bias filter, causes us to lose potentially real enrichments in systems that are more often conserved

“Excess Conservation” advanced the state of the art for binding site prediction “PRISM pipeline” combined accurate binding site prediction with GREAT Publically offered as a web application bejerano.stanford.edu/prism [BejeranoWinter12/13]37 So far…

[BejeranoWinter12/13]38 The rest of the talk includes pre-publication work