Gil McVean Department of Statistics, Oxford Modelling genomes Gil McVean Department of Statistics, Oxford
Why would we want to model a genome? To identify genes Protein-coding RNA Small RNAs To identify regulatory elements Transcription factor binding sites Enhancers To classify genome content Repeat DNA Unique sequence To understand the processes that shape genomes Mutation Recombination Duplication Rearrangement Natural selection
A rather simple model for a protein-coding gene Non-coding DNA Start codon Codon TERM s t STATES T: 30% C: 20% A: 30% G: 20% ATG: 100% AAA: 1/61% AAC: 1/61% AAG: 1/61% … TTT: 1/61% TAA: 30% TAG: 40% TGA: 30% EMISSIONS
Define model Explore properties Estimate parameters from data Test goodness-of-fit Refine model A ‘genome’ model is like any other statistical model
Hidden Markov Models in bioinformatics The model of a gene just described can be thought of as a hidden Markov model (HMM) The underlying states evolve in a Markov fashion, but we observe features (the DNA sequence) emitted by those states You will remember that there are lots of nice computational properties of hidden Markov models that we can use for inference Finding a most likely sequence of states Calculating posterior probabilities of a given state at a given position There are also various algorithms we can use to estimate parameters of HMMs (e.g. ML estimation by EM) How would you use the model of a gene to find new genes? How well do you think it would do?
Making useful HMMs in bioinformatics To be useful, HMMs for genes have to incorporate many features Regulatory sequences Intron-splicing features Correlations and biases in amino acid and base composition A REALLY important feature to capture is their evolution Important parts of genes and genomes evolve slower due to constraint
Searching for homology If we compare human and chimpanzee sequences they are approximately 98.8% identical at the DNA level. It is ‘easy’ to identify which parts of the genome in humans correspond to which parts in chimps If we compare human with, say mouse, we can see some parts that are similar, and other parts where there is only vague or even no obvious similarity. When measuring evolution, we need to identify regions that are homologous Homology means similarity by descent Traditionally, the problem of identifying homology has been intrinsically linked to the problem of alignment
Alignment of PFEMP1 proteins from P. falciparum
The simplest problem: aligning two sequences Suppose we have just two protein sequences that we want to align In evolution, three types of event can happen Mutation to new amino acids Insertion of new amino acids Deletion of amino acids We want to work out which amino acids in the two sequences are homologous – i.e. related to each other through shared ancestry WAKIS WEEKS W—AKIS WEEK-S What do the ‘-’s really mean?
How can we construct an alignment algorithm? What we want to do is to look at every possible alignment and choose the one that is ‘best’ What we have to do is to find an efficient algorithm that can search every possible alignment and that has an objective measure as to what ‘best’ means A natural approach is to make a model of alignments, parameterise it and find the alignment that maximises the likelihood Although the problem sounds hard we can solve it using a hidden Markov model structure
Xi Yj XiXi+1 Xi- XiXi+1 YjYj+1 YjYj+1 Yj- How does is work? Suppose residues Xi and Yj are aligned to each other Three things could happen next The next two residues in each sequence could also align (A) A gap could be introduced in sequence X (B) A gap could be introduced in sequence Y (C) We can parameterise the probabilities of each event Xi Yj XiXi+1 Xi- XiXi+1 (A) (B) (C) YjYj+1 YjYj+1 Yj-
Xi-a…Xi Xi …- Yj …- Yj-a…Yj The full algorithm We need to consider similar transitions for the cases when residue Xi is aligned to a gap after residue Yj, and when Yj is aligned to a gap after Xi We need to specify various probabilities The probability of inserting a gap The probability of extending a gap The probability of finishing the alignment The probability of observing an aligned pair of residues (20x20) The probability of observing a residue aligned to a gap (20) Once specified we can use the Viterbi and Forward/Backward algorithms to identify ML alignments, sample from the posterior or calculate posterior probabilities Xi-a…Xi Xi …- Yj …- Yj-a…Yj
Xi+1 H H D The forward algorithm Emission probabilities = ek(Xi+1 ) Transition probabilities = qij In alignment the state space is two-dimensional (residue i aligned to residue j)
Xi-1 Xi Xi+1 H H H D D D The Viterbi algorithm A traceback matrix is used to keep track of the best partial alignments
W—AKIS WA-KIS WEEK-S WEEK-S An example Suppose the gap opening and extension parameters are 0.2 and 0.5 respectively. There is a 80% chance of observing a match, a 20/19% chance of observing any given mismatch and a 5% chance of observing each unaligned amino acid (We can ignore termination for the moment) The BEST alignments are given below, each of which has log likelihood of -16.84, or 31% of the total likelihood (lnlk = -15.67). In many real situations, the best alignment represents only a fraction of the total likelihood W—AKIS WEEK-S WA-KIS WEEK-S
Posterior decoding Using the forward-backward algorithm we can calculate the posterior probability that any residue is aligned to any other, or that a given residue is in a gap state X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 Y1 Y1 Y2 Y2 Y3 Y3 Y4 Y4 Y5 Y5 Conditional on X2-Y3
Extending the method Originally, alignment algorithms (Needleman and Wunsch, 1970; Smith and Waterman, 1981; Gotoh 1982) were not explicitly defined as hidden Markov models Finite-state automata (FSA) There have been many extensions to the original idea Local alignment Repeat alignment Protein family identification Gene finding Multiple alignment The alignment algorithm is very much a workhorse of bioinformatics, as an alignment is needed or almost all subsequent analyses (e.g. phylogenetic tree reconstruction, population genetic inference) However, relying on a single alignment is not always a great idea
Doing away with alignment For most problems, the alignment is not of primary interest The natural thing to do is to integrate over alignments (as in the FB algorithm) to estimate parameters of interest The key problem is that there is no computationally efficient algorithm for statistical multiple alignment. All widely-used methods use heuristic approaches
Gene conversion and var gene diversity in P. falciparum Multiple alignment methods typically assume the sequences are related to each through an evolutionary tree For the case of multi-gene families, this may not be the case, because gene conversion between copies can lead to mosaic structures If we wish to learn about the processes of conversion, a natural approach is to model the mosaicism In the case of var genes, the sequences are so diverged that we also need to consider the problem of alignment
Mosaic alignment We could model the n+1th sequence as a mosaic of the previous n We can calculate the likelihood of observing a given sequence by summing over all possible mosaic structures and their alignment We can also identify the most likely mosaic structure and calculate the expected number of recombination events Repeating the procedure for all sequences provides a way of assessing the importance of mosaicism within the family
Extensive mosaicism within the var family