Download presentation
Presentation is loading. Please wait.
Published byMarlene Copeland Modified over 6 years ago
1
A Very Basic Gibbs Sampler for Motif Detection
Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute
2
What? What is a motif? What is the biological point? What is the goal?
What is a limitation with dynamic programming? What is Gibbs sampling? What is the program going to do? What is the program missing? What is a much better program?
3
What is a motif? A motif is a sequence pattern that occurs repeatedly in a group of related dna or protein sequences.
4
What is the biological point?
Protein binding sites (dna) Transcription binding factors Typically appear in the promoter region of a gene Few to ~30 base pairs Infer co-regulation of genes A fingerprint is "a group of conserved motifs used to characterise a protein family Encoding the “structural motif” of a protein (dna in exons) Prediction of function (protein)
5
What is the goal? Perform local multiple sequence alignment to find consensus sequences (motifs)
6
What is a limitation with dynamic programming?
Memory and time complexity issues Size of search space = (L-W+1)^N L = length of a sequence W = width of motif N = number of sequences ex: L=30, W=7, N=10 ( )^10 = Alignment of even four such sequences will take a few hours ~10^4 seconds
7
What is Gibbs sampling? Stochastic optimization method
Works well with local multiple alignment without gaps (motif searching) Searches for the statistically most probable motifs by sampling random positions instead of going through entire search space
8
What is the program going to do?
Ask user for : file containing multiple dna or protein sequences motif width how many motifs wanted Calculate the background frequencies of A,C,G,T from all the sequences. [ , , , ]
9
What is the program going to do?
Generate random start positions for the motif in each sequence. ex: 10 sequences, 30 bp in length, motif width of 7 start = [2, 6, 9, 14, 5, 7, 20, 20, 6, 22] >> random.uniform(0,ceiling) where ceiling=len(sequence)-width
10
What is the program going to do?
4. Construct position specific score matrix from all sequences except one. Motif Position 1 2 3 4 5 6 A 0.6 0.7 0.5 0.1 C 0.9 0.2 0.3 G T starts with a randomly selected gapless multiple alignment remove one site from the current multiple alignment replace it with a new site randomly chosen based on the likelihood ratio repeat for a fixed number of iterations alignment with the highest likelihood ratio (score) is printed
11
What is the program going to do?
5. Score the left-out sequence according to the position specific score matrix:
12
What is the program going to do?
Example: Use the position specific matrix and background from before: [A: , C: , G: , T: ] Motif Position 1 2 3 4 5 6 A 0.6 0.7 0.2 0.5 0.1 C 0.9 0.3 G T GATTACA:
13
What is the program going to do?
6. Randomly generate another start position of the motif for that left-out sequence. 7. Score that sequence with its new start position. 8. Compare this new score with its original score. 9. If newscore >= oldscore, then jump to that new start position, else jump to that new start position with probability =
14
What is the program going to do?
10. Start all over again with this updated start position with another sequence left out Do this many many times! ~ 1000 iterations Gibbs will converge to a stationary distribution of the start positions => a probable alignment of the multiple sequences
15
What is the program missing?
Doesn’t do reinitializations in the middle to get out of local maxima Doesn’t optimize the width (you have to specify width explicitly) Doesn’t do the Bayesian approach – just frequentist (easier for me and for you to understand!) Doesn’t read in fasta files Doesn’t do error checking! And other things that don’t know they are missing yet!
16
What is a much better program?
Gibbs Motif Sampler AlignAce the Gibbs Recursive Sampler, which has been developed specifically for locating multiple transcription factor binding sites for multiple transcription factors simultaneously in unaligned DNA sequences that may be heterogeneous in DNA composition AlignACE (Aligns Nucleic Acid Conserved Elements) is a program which finds sequence elements conserved in a set of DNA sequences. It uses a Gibbs sampling strategy which is similar to that described by A. F. Neuwald, J.Liu and C.E. Lawrence in Gibbs motif sampling: Detection of bacterial outer membrane protein repeats An iterative masking procedure is used to allow multiple distinct motifs to be found within a single data set
17
That’s it!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.