Motif discovery EM algorithm Gibbs Sampler Enumeration Regression methods Phylogenetic trees Purpose Construction Finding significance Not directly related to today’s topic: EM algorithm tutorial
From a known family of sequences that are related only by very small motifs (that contribute to the function), find the real motif from vast amounts of junk. Sequence length ~2000; motif size ~10 Science, 262, Finding regulatory motifs - Background
The aligned segment can still be quite diverse. Probability ratios by position: Science, 262,
Finding regulatory motifs - Background Gene Regulatory motifs Transcription aparatus Transcription factor Gene Regulatory motifs Transcription aparatus Transcription factor
ChIP-chip and ChIP-Seq Ren et al. 1999; Iyer et al Thanks to Dr. Steve Qin for the slide.
ChIP-chip and ChIP-Seq Brief Bioinform (2013) 14 (2):
Finding regulatory motifs - Background The key biological question: Knowing a number of TF binding sites from ChIP-seq, or genes regulated by a certain TF by functional studies, how to find the binding motif?
Finding regulatory motifs - Background Difficulties: (1)Motif locations vary with regard to the gene (2)Motifs are small (~10 characters to be found in strings of ~2000 characters) (3)Motif sequences vary (4)Motif may be multi-block (not considered here) Vavouri et al. Genome Biology :R15
Finding regulatory motifs - Background Summarizing a motif from an aligned sequence block: Brief Bioinform (2013) 14 (2):
Finding regulatory motifs - Conservation One common measure of conservation of a motif: entropy for position i: Or, relative entropy compared to background frequency:
Finding regulatory motifs - quality How to judge the quality of an identified motif? The null distribution is not as simple as we hope. As we discussed before, the genome is not random. No simple profile model summarizes the null --- the non-motif background. What to consider: > Position bias: cluster close to the “start codon” ATG > Orientation bias: tend to be up-stream of a gene > Functional specificity: occurring mostly in the sequences under study, and/or mostly near a functional subcategory of genes.
Finding regulatory motifs - formulation The formulation of the computational question: Align tens/hundreds of sequences; only a tiny fraction of the aligned sequences actually support the alignment (the motif) What’s needed: (1)A scoring system (likelihood function, loss function, etc) (2)A very efficient algorithm to search for the optimum in a space of astronomical size Example: align 100 sequences, each 2000 characters long, there’s roughly 4000^99 possibilities
Finding regulatory motifs – scoring system The motif is generated from a position weight matrix; All other parts of the sequences are generated randomly from a background model. Example: Third order Markov background model: CTTATGTA
Finding regulatory motifs - MEME We are simultaneously seeking two things: (1)A motif profile (θ) (2)The location of the motif in each sequence (π) Sounds familiar? If we treat π as missing data, the problem falls into the realm of the EM algorithm. MEME -- Multiple EM for Motif Elicitation Original idea: Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp , AAAI Press, Menlo Park, California, 1994.
Finding regulatory motifs - MEME D'haeseleer PD'haeseleer P. Nat Biotechnol Aug;24(8):
Finding regulatory motifs - MEME The EM algorithm may be trapped at a local optimum. To avoid local optima, MEME tries multiple start points: (1)every n-character word present in the sequences is tried. (2)The best is selected; (3)mask the best, find the second best … … It allows repeated occurrence of a motif in every sequence.
Finding regulatory motifs – Gibbs Sampler Can be considered a stochastic version of the EM algorithm. Probablistic updates avoids local optima to some extent. Lawrence et al. Science 262, Liu, JS et al. J. Am. Stat. Assoc. 90, 1156–1170 S: observed sequences θ: motif parameters θ 0 : background parameters (non-interest, but important) π : motif locations Goal: P(π, θ | θ 0, S) Sample P(π | θ, θ 0, S) Sample P(θ | π, θ 0, S)
Finding regulatory motifs – Gibbs Sampler Randomly initiate θ (a probability matrix) Randomly select one sequence i, score all positions using θ Update π(i) based on pattern probability Iterate until θ converges Weight for each segment: A x =Q x /P x, Q x : likelihood based on the position weight matrix P x : likelihood based on the background model
Finding regulatory motifs – Gibbs Sampler Science 262, Quote:
Finding regulatory motifs – Enumeration Enumeration ??!!! Yes. It works very well compared to EM, Gibbs sampler, and other methods. Tompa, M. et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23, 137–144 (2005) The basic idea: Simply count all possible m-mers in all the sequences of interest. Select those with exceptionally high frequencies (by testing). Merge similar ones into a position weight matrix. Using efficient programming (hash), one scan through the sequences gives us all frequencies. The real methods are much more complicated than above.
Finding regulatory motifs – MDscan To address the problem: some input sequences may not contain the motif. Liu XS et al. Nature Biotechnology 20, (2002) (1)Find all non-redundant w-mers (“seed”) from the most reliable subset of t sequences; (2)Find all matches (≥m identity) in the t sequences; (3)Establish the position weight matrix using these matches; (4)Evaluate the motif matrix (approximate maximum a posteriori (MAP) score):
Finding regulatory motifs – MDscan From the top 10~50 seeds from the previous step, iterate Forward step: Using the non-top sequences, search for the w-mer seed; If the addition of a “match” increases the score of the w- mer, the sequence is added; Backward step: Re-examine all matches of each seed; If the removal of a match increases the score, remove the match; Report the highest scoring motifs.
Finding regulatory motifs – Motif Regressor Conlon et al. PNAS 6:3339. (2003) Goal: link gene expression with motif discovery.
Finding regulatory motifs – Motif Regressor How well a sequence agrees with a motif: all w-mers in this sequence motif model background HMM model For every motif: Stepwise model selection using all “good” motifs from last step:
Modeling inter-dependent positions Zhou and Liu Bioinformatics 2005 Barash et al. RECOMB 2003 Thanks to Dr. Steve Qin for the slide
Detect intra-dependent position pairs 26 Hu et al. Nucleic Acids Research, 2009, 1–14 Build intra-dependency into the motif model.
Phylogenetic tree
Phylogenetic tree It can be seen as a model beyond multiple sequence alignment Sequence similarity Evolutionary relationship The biological relavence of phylogenetic trees Finding evolutionary relationships among organisms Help predict gene functions Finding important mutations in rapidly changing organisms, e.g. HIV In terms of computation, sifting out real mutations from all possible pairs at a given location of an alignment
Building phylogenetic trees Binary tree structure is assumed: At every split, two descendent edges. Biologically: at any time point, only one new species can be generated. edge node The tree can be built for species, or just a certain protein The edge length reflects evolutionary distance Different organisms have different evolution speed Different molecule have different evolution speed
Rooted and unrooted trees The real (hidden) evolutionary path is a rooted tree Phylogenetic trees are estimates to the real tree based on some distance information Some methods estimates where the root is, while some do not a b c d a b c d a b C d … … AAAGGGGTTTT a AAAGGGGCCCC b AAATTTTAAAA c AAACCCCAAAA d
Reconstruction of trees Maximum parsimony Find the tree that requires the least number of mutations to derive the observed sequences. Distance-based reconstruction Similar to clustering methods. Likelihood-based reconstruction Needs a model for sequence evolution. Find the tree that gives the highest likelihood of the observed data. (Not discussed here.) The two methods above can be seen as likelihood-based under very simple sequence evolution models and other assumptions.
Maximum parsimony Four sequences: AAG AAA GGA AGA Three possible trees: #changes: 3 4 4
Maximum parsimony Find the tree that can explain the observed data with minimal number of substitutions. This is analogous to model selection: if two models explain the data equally well, the simpler is preferred. Assumption:Equal evolutionary speed; independence between positions Can apply weights to account for different probabilities of mutations.
Tasks: (1)With a given topology and leaf assignments, find the minimal #substitutions Note: Non-leaf nodes can vary. Needs minimization. Using the independence assumption, can compute one position at a time. (2) Search through all possible topologies with leaf assignments. Maximum parsimony
Task (1) example: the simplest: Fitch’s algorithm Bottom to top: Maximum parsimony AG T T A TAG AGT A Min(# substitutions) =#union operations =2
Task (2) is a very hard problem. There are ~34 million possible trees when n=10. A number of heuristic methods were proposed to deal with the problem. A few names here: Some of such methods use a good topology found by distance- based methods (below) as a starting point. Maximum parsimony Nearest Neighbour Interchanges Branch and Bound Tree Bisection and Reconnection …
Distance- based methods UPGMA Same assumptions as parsimony
Likelihood-based methods Data (aligned sequences): D Evolutionary model: M Tree structure: T Maximize Prob(D | M, T) M: The probability of changing from character b to character a in edge length (time) t. With independence assumption, between two sequences:
Likelihood-based methods The probability of a tree is the product of the sequence mutation rates in each branch of the tree, which is in turn the product of the rate of substitution in each branch times the branch length. It allows mutation rate variations across branches. Computationally costly.
Assessing significance A simple non-parametric bootstrap strategy: Resample columns (positions) from the aligned sequences. The confidence of a branching event is assessed by the fraction of its occurrence in the trees from the resampled sequences.
Assessing significance Derdeyn et al. Science. 303(5666):