A Very Basic Gibbs Sampler for Motif Detection

A Very Basic Gibbs Sampler for Motif Detection
Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute

What? What is a motif? What is the biological point? What is the goal?
What is a limitation with dynamic programming? What is Gibbs sampling? What is the program going to do? What is the program missing? What is a much better program?

What is a motif? A motif is a sequence pattern that occurs repeatedly in a group of related dna or protein sequences.

What is the biological point?
Protein binding sites (dna) Transcription binding factors Typically appear in the promoter region of a gene Few to ~30 base pairs Infer co-regulation of genes A fingerprint is "a group of conserved motifs used to characterise a protein family Encoding the “structural motif” of a protein (dna in exons) Prediction of function (protein)

What is the goal? Perform local multiple sequence alignment to find consensus sequences (motifs)

What is a limitation with dynamic programming?
Memory and time complexity issues Size of search space = (L-W+1)^N L = length of a sequence W = width of motif N = number of sequences ex: L=30, W=7, N=10 ( )^10 = Alignment of even four such sequences will take a few hours ~10^4 seconds

What is Gibbs sampling? Stochastic optimization method
Works well with local multiple alignment without gaps (motif searching) Searches for the statistically most probable motifs by sampling random positions instead of going through entire search space

What is the program going to do?
Ask user for : file containing multiple dna or protein sequences motif width how many motifs wanted Calculate the background frequencies of A,C,G,T from all the sequences. [ , , , ]

Generate random start positions for the motif in each sequence. ex: 10 sequences, 30 bp in length, motif width of 7 start = [2, 6, 9, 14, 5, 7, 20, 20, 6, 22] >> random.uniform(0,ceiling) where ceiling=len(sequence)-width

4. Construct position specific score matrix from all sequences except one. Motif Position 1 2 3 4 5 6 A 0.6 0.7 0.5 0.1 C 0.9 0.2 0.3 G T starts with a randomly selected gapless multiple alignment remove one site from the current multiple alignment replace it with a new site randomly chosen based on the likelihood ratio repeat for a fixed number of iterations alignment with the highest likelihood ratio (score) is printed

5. Score the left-out sequence according to the position specific score matrix:

Example: Use the position specific matrix and background from before: [A: , C: , G: , T: ] Motif Position 1 2 3 4 5 6 A 0.6 0.7 0.2 0.5 0.1 C 0.9 0.3 G T GATTACA:

6. Randomly generate another start position of the motif for that left-out sequence. 7. Score that sequence with its new start position. 8. Compare this new score with its original score. 9. If newscore >= oldscore, then jump to that new start position, else jump to that new start position with probability =

10. Start all over again with this updated start position with another sequence left out Do this many many times! ~ 1000 iterations Gibbs will converge to a stationary distribution of the start positions => a probable alignment of the multiple sequences

What is the program missing?
Doesn’t do reinitializations in the middle to get out of local maxima Doesn’t optimize the width (you have to specify width explicitly) Doesn’t do the Bayesian approach – just frequentist (easier for me and for you to understand!) Doesn’t read in fasta files Doesn’t do error checking! And other things that don’t know they are missing yet!

What is a much better program?
Gibbs Motif Sampler AlignAce the Gibbs Recursive Sampler, which has been developed specifically for locating multiple transcription factor binding sites for multiple transcription factors simultaneously in unaligned DNA sequences that may be heterogeneous in DNA composition AlignACE (Aligns Nucleic Acid Conserved Elements) is a program which finds sequence elements conserved in a set of DNA sequences. It uses a Gibbs sampling strategy which is similar to that described by A. F. Neuwald, J.Liu and C.E. Lawrence in Gibbs motif sampling: Detection of bacterial outer membrane protein repeats An iterative masking procedure is used to allow multiple distinct motifs to be found within a single data set

That’s it!

A Very Basic Gibbs Sampler for Motif Detection

Similar presentations

Presentation on theme: "A Very Basic Gibbs Sampler for Motif Detection"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Very Basic Gibbs Sampler for Motif Detection

Similar presentations

Presentation on theme: "A Very Basic Gibbs Sampler for Motif Detection"— Presentation transcript:

Similar presentations

About project

Feedback