Algorithms for Regulatory Motif Discovery

Slides:



Advertisements
Similar presentations
Control of Expression In Bacteria –Part 1
Advertisements

Definitions Gene – sequence of DNA that is expressed as a protein (exon) Genes are coded –DNA →RNA→Protein→Trait Transcription – rewritting DNA into RNA.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
12-5 Gene Regulation.
Four of the many different types of human cells: They all share the same genome. What makes them different?
Comparative Motif Finding
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Fuzzy K means.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Finding Regulatory Motifs in DNA Sequences
Chromosomes carry genetic information
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Introduction to gene expression Seema Zargar. Lecture outline Introduction to all terms used in Gene expression.
Gene regulation  Two types of genes: 1)Structural genes – encode specific proteins 2)Regulatory genes – control the level of activity of structural genes.
Part Transcription 1 Transcription 2 Translation.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Genome Organization & Evolution. Chromosomes Genes are always in genomic structures (chromosomes) – never ‘free floating’ Bacterial genomes are circular.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
6D Gene expression the process by which the heritable information in a gene, the sequence of DNA base pairs, is made into a functional gene product, such.
Control of Gene Expression Year 13 Biology. Exceptions to the usual Protein Synthesis Some viruses contain RNA and no DNA. RNA is therefore replicated.
The Lac Operon An operon is a length of DNA, made up of structural genes and control sites. The structural genes code for proteins, such as enzymes.
Section 2 CHAPTER 10. PROTEIN SYNTHESIS IN PROKARYOTES Both prokaryotic and eukaryotic cells are able to regulate which genes are expressed and which.
Gene Regulations and Mutations
A Biology Primer Part III: Transcription, Translation, and Regulation Vasileios Hatzivassiloglou University of Texas at Dallas.
Introduction to Gene Expression
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Gene Expression. Remember, every cell in your body contains the exact same DNA… …so why does a muscle cell have different structure and function than.
Gene Regulation Packet #46 Chapter #19.
Introduction to Molecular Cell Biology Transcription Regulation Dr. Fridoon Jawad Ahmad HEC Foreign Professor King Edward Medical University Visiting Professor.
Flat clustering approaches
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Click to continue How do a few genes build a diversity of body parts? There’s more in the genetic toolkit than just genes! Click your forward cursor to.
Gene Structure and Regulation. Gene Expression The expression of genetic information is one of the fundamental activities of all cells. Instruction stored.
Eukaryotic Gene Regulation
Control of Gene Expression
Regulation of Gene Expression
Chapter 12.5 Gene Regulation.
Recitation 7 2/4/09 PSSMs+Gene finding
1 Department of Engineering, 2 Department of Mathematics,
Copyright Pearson Prentice Hall
1 Department of Engineering, 2 Department of Mathematics,
Lecture 4 By Ms. Shumaila Azam
A Zero-Knowledge Based Introduction to Biology
1 Department of Engineering, 2 Department of Mathematics,
12-5 Gene Regulation.
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Control of Gene Expression in Eukaryotic cells
Gene Regulation Packet #22.
Gene Expression Activation of a gene to transcribe DNA into RNA.
Motif finding in groups of related sequences
Mapping Global Histone Acetylation Patterns to Gene Expression
Copyright Pearson Prentice Hall
Presented by, Jeremy Logue.
Copyright Pearson Prentice Hall
Copyright Pearson Prentice Hall
Volume 122, Issue 6, Pages (September 2005)
Prokaryotic (Bacterial) Gene Regulation
Presented by, Jeremy Logue.
Gene Discovery.
Copyright Pearson Prentice Hall
Gene Regulation A gene (DNA) is expressed when it is made into a functional product (protein/enzyme)
Copyright Pearson Prentice Hall
DNA AND RNA 12-5 Gene Regulation.
Gene Discovery.
Presentation transcript:

Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine

TTACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATGATTGTATGATAATGTTTTCTCCTTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAACTGAGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCTCCTTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATATGCTTCAACTACTTAATAAATGATCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGAATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACACATGCTTCAACTACTTAATAAATGCAGATGCTGTTGGACTTCATGTCCCCAACCTAGCTTGGTGCACAGCATTTATTGTATGAAGAGATTTCGATTATCCTTATAGTTCATACATGCATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGTATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCATTCAGGTTGGTACGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGAAAATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATGCAGGAGAACGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAATAGTGTTTCTTGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAATGCACTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAAATAAAGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTGTATGATTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGATTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTCATACATGCTTCAACTACTGTAAATAATTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAAATAAAATGTAAGAGATTTCGATTATCCTTATAGTTCATACEEEEEEECATGCGTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTTGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTGTATGATTTATATCG

Today’s Goals Enumeration-based method Probabilistic modeling P-value Binomial Hypergeometric Probabilistic modeling Position Weight Matrix EM-algorithm

The Central Dogma Transcription RNA Translation Protein DNA

Readout from the genome

Comparison of genome size Organisms Genomes

Expression of genes at different time points of cell cycle in yeast cells

Transcriptional Regulation DNA binding proteins Non-coding region Activator Repressor Coding region (transcribed) RNA transcript Binding sites (specific sequences)

Regulatory motifs Motifs are fundamental units of gene regulation: What turns genes on (producing a protein) and off? When is a gene turned on or off? Where (in which cells) is a gene turned on? How many copies of the gene product are produced? Specialized proteins (transcription factors) recognize these motifs What we know about regulatory motifs: Motifs are short (6-20 bp), sometimes degenerate Can contain any set of nucleotides (no ATG or other rules) Act at variable distances upstream (or downstream) from target gene (could be 100 Kb upstream or downstream) Human genome contains roughly 2000 motifs

Regulatory motif discovery 1 CAAACTCCTGCACGTGTCTCAAGGAATTTCCCGCCTCTGTCTTCTGAGTT 2 GGCTACAGATGTGTACCACGCACGTGGAACCCAGCTGATTTCCCACCTTT 3 TTATCACGTGGAGCAAACGATTAGGGAGAATTAATTATTCTCTTCCTCTT 4 AGGAAATGATGTTTACCCTAACCCAAAATGTAAGACACGTGATTTATCAG 5 ACTACCTATAAAGAAGACACGTGAATCTGTCCTGCTTGGTGGTGTAAGGA 6 TTGGATTTGGTCCCACGTGATGTCAGAAGATTGCGGACCAAATCCCACTA 7 GCACACGTGGGGTCATTTGGAGAAAGATACTTTGTAACATTGGACCTCTG 8 CATCTGTAAAACACGTGTGGGAATAGTAAGAATAATAATACTTGTCTCAC 9 ATGTGAAGGTAAAATGAGGTCATGCACGTGTGTGCACAGAATCTAGTCCA 10 AGAACATACCTGGCACTCAATTAATATGAGATAATTGTGCCATGCCTTAA 11 GTATAAGATTTGTTATTACCGCACGTGTAAACACTACAGCATGAATTTGC 12 ACTGCCAAACACGTGTGGAGGTTTAAGTTCTGATTCCTGATGATGAAATA 13 CTCTGGCCTGCTACGTTAACACGTGAAACAGCACTGATGGTAAAGGCTAA 14 TTGGATTTGGTCCCACGTGATGTCAGAAGATTGCGGACCAAATCCCACTA 15 ACTACCTATAAAGAAGACACGTGAATCTGTCCTGCTTGGTGGTGTAAGGA Promoter sequences for 15 genes

Method 1: Enumeration List all potential motifs with a given length For instance, 6-mer motifs AAAAAA 0 AAAAAC 1 AAAAAG 2 AAAAAT 1 … CACGTG 15 TTTTTT 1 TTTTTT 0 Total: 46=4096 6-mers

Regulatory motif discovery CAAACTCCTGCACGTGTCTCAAGGAATTTCCCGCCTCTGTCTTCTGAGTT GGCTACAGATGTGTACCACGCACGTGGAACCCAGCTGATTTCCCACCTTT TTATCACGTGGAGCAAACGATTAGGGAGAATTAATTATTCTCTTCCTCTT AGGAAATGATGTTTACCCTAACCCAAAATGTAAGACACGTGATTTATCAG ACTACCTATAAAGAAGACACGTGAATCTGTCCTGCTTGGTGGTGTAAGGA TTGGATTTGGTCCCACGTGATGTCAGAAGATTGCGGACCAAATCCCACTA GCACACGTGGGGTCATTTGGAGAAAGATACTTTGTAACATTGGACCTCTG CATCTGTAAAACACGTGTGGGAATAGTAAGAATAATAATACTTGTCTCAC ATGTGAAGGTAAAATGAGGTCATGCACGTGTGTGCACAGAATCTAGTCCA AGAACATACCTGGCACTCAATTAATATGAGATAATTGTGCCATGCCTTAA GTATAAGATTTGTTATTACCGCACGTGTAAACACTACAGCATGAATTTGC ACTGCCAAACACGTGTGGAGGTTTAAGTTCTGATTCCTGATGATGAAATA CTCTGGCCTGCTACGTTAACACGTGAAACAGCACTGATGGTAAAGGCTAA Promoter sequences for 15 genes

How to measure significance? Definition: Given n sequences {S1, S2, …, Sn}. Use |Si| to denote the length of sequence Si Consider a particular k-mer m (e.g. m=CACGTG, k=6) Assumption: Each of the four letters (A,C,G,T) is equally likely to occur at any position of each sequence, that is P(Sij=A)=P(Sij=C)=P(Sij=G)=P(Sij=T)=1/4 The sequences are of equal length, i.e., |Si| = L for all i Question: What is the chance of observing at least x sequences that contain this particular k-mer?

Binomial distribution Experiment 1: Sampling with replacement A box with 3 red balls and 7 green balls: Q1: Randomly pick one ball. What’s the chance that the ball is red? p = 3/10 Q2: Randomly pick one ball. Place back a ball with the same color. Repeat n times. What’s the chance that the total number of red balls you pick is k? P(k) = (nk) pk (1-p)n-k

Binomial distribution

Binomial distribution When n is large, the binomial distribution can be approximated by the normal distribution with Mean: np Variance: np(1-p)

A different null model Take the collection of all promoters as a control. Obtain the following four numbers: N: the total number of promoters n: the number of promoters in my dataset (believed to have an enriched motif) K: the total number of promoters containing the k-mer k: the total number of promoters in my dataset containing the k-mer Assume that N, n, K are given. If the promoters in my dataset is randomly chosen, what is the distribution of k?

Hypergeometric distribution Experiment 2: Sampling without replacement A box with 30 red balls and 70 green balls: Q1: Randomly pick one ball. What’s the chance that the first ball is red? p = 3/10 Q2: Randomly pick one ball. Repeat n times. What’s the chance that the total number of red balls you pick is k? Total number of red balls: K Total number of balls: N

Hypergeometric The distribution can be approximated by the binomial distribution, if K and N are much larger than n and K/N is not close to 0 or 1.

Hypergeometric

Positional weight matrix representation Lambda cI/cro binding sites A: 9 94 25 1 71 17 32 48 63 C: 86 55 40 G: T: Sequence Logo

Multinomial distribution Experiment 3: Sampling with replacement A box with 1 red balls, 2 black balls, 3 pink balls, and 4 green balls: Q1: Randomly pick one ball. What’s the chance that the ball is red? p1=1/10 p2=2/10 p3=3/10 p4=4/10 Q2: Sampling with replacement, what’s the chance of picking 3 balls with colors in the order of: p1 * p4 * p2

Position Weight Matrix Experiment 4: Sampling with replacement from W=3 boxes Boxes with nucleotides: A, C, G, and T:: 1 2 3 G G C G A T T G T T A T C C T A T C C T A T C C G G A G A G G T A C            Q: Pick one letter from each box in the order of 1,2,3. What’s the chance of picking: ACT?   