REGULATORY GENOMICS Saurabh Sinha, Dept. of Computer Science & Institute of Genomic Biology, University of Illinois.

Slides:

Advertisements

Similar presentations

Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.

Advertisements

Methods to read out regulatory functions

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.

Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.

Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.

Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.

Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.

Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.

Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.

Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.

Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.

Transcription factor binding motifs (part I) 10/17/07.

Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.

Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.

Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.

Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.

Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.

REGULATORY GENOMICS Saurabh Sinha, Dept. of Computer Science & Institute of Genomic Biology, University of Illinois.

Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.

Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,

Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)

A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.

Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.

Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.

Copyright OpenHelix. No use or reproduction without express written consent1.

Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.

MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.

Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

EB3233 Bioinformatics Introduction to Bioinformatics.

Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.

Local Multiple Sequence Alignment Sequence Motifs

. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.

Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.

Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.

Finding genes in the genome

Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.

Transcription factor binding motifs (part II) 10/22/07.

Regulation of Gene Expression

Yiming Kang, Hien-haw Liow, Ezekiel Maier, & Michael Brent

Regulatory Genomics Lab

A Very Basic Gibbs Sampler for Motif Detection

Motifs BCH364C/394P - Systems Biology / Bioinformatics

Statistical Testing with Genes

 The human genome contains approximately genes.  At any given moment, each of our cells has some combination of these genes turned on & others.

Learning Sequence Motif Models Using Expectation Maximization (EM)

De novo Motif Finding using ChIP-Seq

Genome Center of Wisconsin, UW-Madison

Introduction to Bioinformatics II

Schedule for the Afternoon

Protein Occupancy Landscape of a Bacterial Genome

Finding regulatory modules

ChIP-seq Robert J. Trumbly

Regulatory Genomics Lab

BIOBASE Training TRANSFAC® ExPlain™

Statistical Testing with Genes

Regulatory Genomics Lab

Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016

Presentation transcript:

REGULATORY GENOMICS Saurabh Sinha, Dept. of Computer Science & Institute of Genomic Biology, University of Illinois.

Introduction …

Genes to proteins: “transcription” RNA Polymerase http://instruct.westvalley.edu/svensson/CellsandGenes/Transcription%5B1%5D.gif

Regulation of gene expression More frequent transcription => more mRNA => more protein.

Cell states defined by gene expression Genes 1&4 Genes 1&2 Genes 1&3

Cell states defined by gene expression Nurse behavior Foraging behavior anterior posterior fertilization gastrulation Disease onset Progressed disease

How is the cell state specified ? In other words, how is a specific set of genes turned on at a precise time and cell ?

Gene regulation by “transcription factors” TF GENE

Transcription factors may activate … or repress TF TF

Gene regulation by transcription factors determines cell state Bcd (TF) Kr (TF) Eve (TF) Hairy Gene Eve is “on” in a cell if “Bcd present AND Kr NOT present”. Regulatory effects can “cascade”

Gene regulatory networks

Goal: discover the gene regulatory network Sub-goal: discover the genes regulated by a transcription factor

Genome-wide assays One experiment per cell type ... tells us where the regulatory signals may lie One experiment per cell type AND PER TF ... tells us which TF might regulate a gene of interest Expensive !

Goal: discover the gene regulatory network Sub-goal: discover the genes regulated by a transcription factor … by DNA sequence analysis

The regulatory network is encoded in the DNA http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=mcb&part=A2574 TF GENE BINDING SITE TCTAATTG A TF bound to a “TAAT” site in DNA. It should be possible to predict where transcription factors bind, by reading the DNA sequence

Motifs and DNA sequence analysis

Finding TF targets Step 1. Determine the binding specificity of a TF Step 2. Find motif matches in DNA Step 3. Designate nearby genes as TF targets

Step 1. Determine the binding specificity of a TF ACCCGTT ACCGGTT ACAGGAT ACCGGTT ACATGAT “MOTIF” 5 2 A 3 1 C G T

How? 1. DNase I footprinting TAACCCGTTC GTACCGGTTG ACACAGGATT http://nationaldiagnostics.com/article_info.php/articles_id/31 TAACCCGTTC GTACCGGTTG ACACAGGATT AACCGGTTA GGACATGAT TAACCCGTTC GTACCGGTTG ACACAGGATT AACCGGTTA GGACATGAT

How? 2. SELEX TAACCCGTTC GTACCGGTTG ACACAGGATT AACCGGTTA GGACATGAT http://altair.sci.hokudai.ac.jp/g6/Projects/Selex-e.html

How? 3. Bacterial 1-hybrid TAACCCGTTC GTACCGGTTG ACACAGGATT AACCGGTTA GGACATGAT TAACCCGTTC GTACCGGTTG ACACAGGATT AACCGGTTA GGACATGAT http://upload.wikimedia.org/wikipedia/commons/5/56/FigureB1H.JPG

How? 4. Protein binding microarrays TAACCCGTTC GTACCGGTTG ACACAGGATT AACCGGTTA GGACATGAT TAACCCGTTC GTACCGGTTG ACACAGGATT AACCGGTTA GGACATGAT http://bfg.oxfordjournals.org/content/9/5-6/362/F2.large.jpg

Motif finding tools TAACCCGTTC GTACCGGTTG ACACAGGATT AACCGGTTA How did this happen? TAACCCGTTC GTACCGGTTG ACACAGGATT AACCGGTTA GGACATGAT TAACCCGTTC GTACCGGTTG ACACAGGATT AACCGGTTA GGACATGAT Run a motif finding tool (e.g., “MEME”) on the collection of experimentally determined binding sites, which could be of variable lengths, and in either orientation. We’ll come back to motif finding tools later.

Motif Databases JASPAR: http://jaspar.genereg.net/

Motif Databases JASPAR: http://jaspar.genereg.net/

Motif Databases TRANSFAC Hocomoco: http://hocomoco.autosome.ru/ Public version and License version Hocomoco: http://hocomoco.autosome.ru/ Human and mouse motifs UniProbe: http://the_brain.bwh.harvard.edu/uniprobe/ variety of organisms, mostly mouse and human

Motif Databases Fly Factor Survey: http://pgfe.umassmed.edu/TFDBS/ Drosophila specific In analyzing insect genomes, motifs from this database can be used, with some additional checks.

Sample entry in Fly Factor Survey

Step 2. Finding motif matches in DNA Basic idea: To score a single site s for match to a motif W, we use Motif: Match: CAAAAGGGTTA Apprx. Match: CAAAAGGGGTA

What is Pr (s | W)? Now, say s =ACCGGTT (consensus) 5 2 A 3 1 C G T 1 0.4 A 0.6 0.2 C G T Now, say s =ACCGGTT (consensus) Pr(s|W) = 1 x 1 x 0.6 x 0.6 x 1 x 0.6 x 1 = 0.216. Then, say s = ACACGTT (two mismatches from consensus) Pr(s|W) = 1 x 1 x 0.4 x 0.2 x 1 x 0.6 x 1 = 0.048.

Scoring motif matches with “LLR” Pr (s | W) is the key idea. However, some statistical massaging is done on this. Given a motif W, background nucleotide frequencies Wb and a site s, LLR score of s = Good scores > 0. Bad scores < 0.

The Patser program Takes motif W, background Wb and a sequence S. Scans every site s in S, and computes its LLR score. Uses sound statistics to deduce an appropriate threshold on the LLR score. All sites above threshold are predicted as binding sites. Note: another program to do the same thing: FIMO (Grant, Bailey, Noble; Bioinformatics 2011).

Finding TF targets Step 1. Determine the binding specificity of a TF Step 2. Find motif matches in DNA Step 3. Designate nearby genes as TF targets

Step 3: Designating genes as targets Designate this gene as a target of the TF Predicted binding sites for motif of TF called “bcd” Sub-goal: discover the genes regulated by a transcription factor … by DNA sequence analysis

Computational motif discovery

Why? We assumed that we have experimental characterization of a transcription factor’s binding specificity (motif) What if we don’t? There’s a couple of options …

Option 1 Suppose a transcription factor (TF) regulates five different genes Each of the five genes should have binding sites for TF in their promoter region Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Binding sites for TF

Option 1 Now suppose we are given the promoter regions of the five genes G1, G2, … G5 Can we find the binding sites of TF, without knowing about them a priori ? This is the computational motif finding problem To find a motif that represents binding sites of an unknown TF

Option 2 Suppose we have ChIP-chip or ChIP-Seq data on binding locations of a transcription factor. Collect sequences at the peaks Computationally find the motif from these sequences This is another version of the motif finding problem

Motif finding algorithms Version 1: Given promoter regions of co-regulated genes, find the motif Version 2: Given bound sequences (ChIP peaks) of a transcription factor, find the motif Idea: Find a motif with many (surprisingly many) matches in the given sequences

Motif finding algorithms Gibbs sampling (MCMC) : Lawrence et al. 1993 MEME (Expectation-Maximization) : Bailey & Elkan 94 CONSENSUS (Greedy search) : Stormo lab. Priority (Gibbs sampling, but allows for additional prior information to be incorporated): Hartemink lab. Many many others …

Examining one such algorithm

The “CONSENSUS” algorithm Final goal: Find a set of substrings, one in each input sequence Set of substrings define a motif. Goal: This motif should have high “information content”. High information content means that the motif “stands out”.

The “CONSENSUS” algorithm Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings.

The “CONSENSUS” algorithm Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif.

The “CONSENSUS” algorithm Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time ? ? ? ? The current set of substrings. The current motif. Consider every substring in the next sequence, try adding it to current motif and scoring resulting motif

The “CONSENSUS” algorithm Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif. Pick the best one ….

The “CONSENSUS” algorithm Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif. Pick the best one …. … and repeat

The key: Scoring a motif The current motif. Scoring a motif:

The key: Scoring a motif α i=1 ... 8 A 1 9 C 6 8 7 G T The current motif. Scoring a motif: Compute information content of motif: For each column, Compute information content of column Sum over all columns Key: to align the sites of a motif, and score the alignment

Summary so far To find genes regulated by a TF Determine its motif experimentally Scan genome for matches (e.g., with Patser & the LLR score) Motif can also be determined computationally From promoters of co-expressed genes From TF-bound sequences determined by ChIP assays MEME, Priority, CONSENSUS, etc.

Further reading Introduction to theory of motif finding Moses & Sinha: http://www.moseslab.csb.utoronto.ca/Moses_Sinha_Bioi nf_Tools_apps_2009.pdf Das & Dai: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC20994 90/pdf/1471-2105-8-S7-S21.pdf

Motif finding tools MEME: http://meme-suite.org/ Weeder: http://159.149.160.51/modtools/ CisFinder: http://lgsun.grc.nia.nih.gov/CisFinder/ RSAT: http://rsat.ulb.ac.be/rsat//RSAT_home.cgi

Motif scanning

Recap Step 3: Designating genes as targets Designate this gene as a target of the TF Predicted binding sites for motif of TF called “bcd” But there is a problem …

Too many predicted sites ! An idea: look for clusters of sites (motif matches)

Why clusters of sites? Because functional sites often do occur in clusters. Because this makes biophysical sense. Multiple sites increase the chances of the TF binding there.

On looking for “clusters of matches” To score a single site s for match to a motif W, we use To score a sequence S for a cluster of matches to motif W, we use But what is this probability? A popular approach is to use “Hidden Markov Models” (HMM) length ~10 length ~1000

The HMM score profile This is what we had, using LLR score: This is what we have, with the HMM score These are the functional targets of the TF

Step 3: Designating genes as targets Designate this gene as a target of the TF HMM Score for binding site presence In 1000 bp window centered here. Sub-goal: discover the genes regulated by a transcription factor … by DNA sequence analysis

Integrating sequence analysis and expression data

1. Predict regulatory targets of a TF GENE TF ? Motif module: a set of genes predicted to be regulated by a TF (motif)

2. Identify dysregulated genes in phenotype of interest Honeybee whole brain transcriptomic study Genes up-regulated in nurses vs foragers Robinson et al 2005

3. Combine motif analysis and gene expression data All genes (N) Genes expressed in Nurse bees (n) Genes regulated by TF (m) Higher in Nurse bees and regulated by TF(k) Is the intersection (size “k”) significantly large, given N, m, n?

3. Combine motif analysis and gene expression data All genes (N) Genes expressed in Nurse bees (n) Genes regulated by TF (m) Infer: TF may be regulating “Nurse” genes. An “association” between motif and condition

Hypergeometric Distribution Given that n of N genes are labeled “Nurse high”. If we picked a random sample of m genes, how likely is an intersection equal to k ?

Hypergeometric Distribution Given that n of N genes are labeled “Nurse high”. If we picked a random sample of m genes, how likely is an intersection equal to or greater than k ?

Further reading Motif scanning and its applications Kim et al. 2010. PMID: 20126523 http://www.ploscompbiol.org/article/info%3Adoi%2F10.1 371%2Fjournal.pcbi.1000652 Sinha et al. 2008. PMID: 18256240. http://genome.cshlp.org/content/18/3/477.long

Useful tools GREAT: http://bejerano.stanford.edu/great/public/html/ Input a set of genomic segments (e.g., ChIP peaks) Obtain what annotations enriched in nearby genes only for human, mouse and zebrafish MET: http://veda.cs.uiuc.edu/cgi-bin/regSetFinder/interface.pl Input a set of genes Obtain what TF motifs or ChIP peaks enriched near them Supports human, mouse, chimp, macaque, honeybee, mosquito, wasp, beetle, fruitfly, planaria, tomato, Arabidopsis, etc.

Questions ?