Proteins dictate function in an organism: What happens as proteins evolve? Budding yeast Fission yeast Saccharomyces pombe (sugar fungus) Schizosaccharomyces pombe In our project, we'll be determining if functional homologs of S. cerevisiae Met proteins are present in S. pombe
What organism should the class study after we finish S. pombe genes? This semester: Five genes from S. pombe will be transferred to S. cerevisiae What organism should the class study after we finish S. pombe genes? A look at the molecular phylogeny should help
Are there any correlations between the kind of amino acid substitutions observed over evolution with their chemistry? How are bioinformatics tools used to analyze the conservation of protein sequences? How can I identify regions of proteins that are most strongly conserved and most likely to be important for function?
Met16p from S. cerevisiae complexed with PAP (2OQ2) For proteins to maintain their function, they don't tolerate drastic changes to their shapes Amino acid substitutions that significantly perturb the structure of a protein or alter its chemistry can cause the protein to lose function Met16p from S. cerevisiae complexed with PAP (2OQ2)
Recall that the final folded form of a protein is determined by its primary sequence R (“reactive”) groups form a variety of bonds important for structure and function
Custom view of Met16p highlights Cys Cysteine is one of the most evolutionarily constrained amino acids Cys-254 is in close proximity to the end-product, PAP, suggesting that it plays a role in catalysis Custom view of Met16p highlights Cys Protein: backbone view PAP: ball-and-stick Cysteine: space-fill
Charged Acidic Basic Polar Aromatic Small Neutral Hydrophobic Glu (E) Amino acids can be grouped according to the chemistry and size of their R groups Glu (E) Asp (D) Acidic Arg (R) Lys (K) His (H) Basic Charged Asn (N) Gln (Q) Polar Thr (T) Gly (G) Cys (C) Ser (S) Ala (A) Small Neutral Tyr (Y) Aromatic Hydrophobic Val (V) Ile (I) Leu (L) Met (M) Pro (P) Trp (W) Phe (F)
Most amino acids are abbreviated by their first letter: (Abundant, hydrophobic ones get preference) A Ala alanine C Cys cysteine G Gly glycine H His histidine I Ile isoleucine L Leu leucine M Met methionine P Pro proline S Ser serine T Thr threonine V Val valine Phonetic abbreviations: F Phe phenylalanine R Arg arginine Oddballs: (Charged, aromatic, some polar) D Asp aspartic acid E Glu glutamic acid K Lys lysine N Asn asparagine Q Gln glutamine W Trp tryptophan Y Tyr tyrosine The one letter code needs to be part of a 21st century biologist’s vocabulary
Matrix assigns scores for substitutions: Studying the evolutionary conservation of amino acids in sequences provides a sense of the importance of the amino acid to protein function BLOSUM62 (BLOck SUbstitution Matrix) was based on statistical alignments seen in proteins that are at least 62% identical Matrix assigns scores for substitutions: Maximum score for the same amino acid (completely conserved, possibly essential) Positive scores are awarded for common amino acid substitutions, in decreasing order, based on their occurrence in proteins Negative scores are unlikely substitutions Note the high score for Cys! The biochemical connection: Higher scores are frequently correlated with conservative amino acid substitutions based on amino acids chemistry and size
Are there any correlations between the kind of amino acid substitutions observed over evolution with their biochemistry? How are bioinformatics tools used to analyze the conservation of protein sequences? How can I identify regions of proteins that are most strongly conserved and most likely to be important for function?
BLAST BLAST is an acronym for Basic Local Alignment Search Tool, a computer algorithm for finding homologous sequences in databases BLASTN compares nucleic acid sequences BLASTP compares protein sequences BLOSUM62 is the default scoring matrix for BLASTP
Qi and Qj are probabilities of finding i and j randomly in a sequence BLOSUM 62 scores relate the frequency of a particular substitution to the probability that it occurs by chance in proteins that are at least 62% identical throughout their length Score = k log10 Pij Qi * Qj ( ) Scaling factor used to produce integral values Pij is the observed frequency of two amino acids (i and j) replacing each other in homologous sequences Qi and Qj are probabilities of finding i and j randomly in a sequence
Positive and negative scores suggest amino acid changes have been selected for (positive) or against (negative) during evolution Magnitude of the score suggests the strength of the selection Score of zero suggests that a particular substitution can be explained by chance alone
BLASTP begins with a query sequence (e.g. your MET sequence) The query sequence is broken into "words" that will act as seeds in alignments Words Query BLAST searches for matches (or synonyms) in target entries in the database Word match Target sequence If a target entry has two or more matches to "words" from the query, the alignment is extended in both directions looking for additional similarity Word match Target sequence
E A G A G L G L E L E S "Words" are integral to the BLASTP search BLASTP uses a sliding window to identify words Consider the sequence: E A G L E S BLASTP would break this down into a series of four 3-letter words: E A G A G L G L E L E S Tip! Use a non-proportional word font such as Courier when working with database entries. The fonts are uglier, but the letters have a constant spacing that generates nice columns! Next: words are given a numerical score
E A G A G L G L E L E S E A G A G L G L E L E S BLASTP uses the BLOSUM62 matrix as its default for assigning values to words E A G A G L G L E L E S 5 + 4 + 6 = 15 4 + 6 + 4 = 14 6 + 4 + 5 = 15 4 + 5 + 4 = 13 BLASTP next checks for word synonyms (1-letter replacements)with a score greater than a default threshold of 10 E A G A G L G L E L E S K A G (11) E S G (12) E C G (11) E T G (11) E V G (11) G I E (13) G L D (12) G L Q (12) S G L (11) A G I (12) I E S (13) BLASTP will search for all of these words and synonyms in the protein database Of the 60 possible synonyms for each word, only a small handful are statistically likely to appear in homologous proteins
Sequences must have at least two words for further consideration BLASTP uses word matches as a nucleus and extends them in both directions, looking for additional similarity Word match Target sequence Original search word Q A S T L Y E - A G L E S E A T T N - - R R E I + A + T + + + G L E S E A + + R + E + N A A T Y W D A S G L E S - - - S Q I I R K E L Query Summary Target As BLASTP extends the alignment out from the match, it calculates a running score – extension stops when the score drops below a threshold value Penalties are assigned for gaps and mismatches Plus signs in summary line indicate a positive BLOSUM62 value
Are there any correlations between the kind of amino acid substitutions observed over evolution with their biochemistry? How are bioinformatics tools used to analyze the conservation of protein sequences? How can I identify regions of proteins that are most strongly conserved and most likely to be important for function?
Highly conserved protein sequences are often essential for function You will compare sequences of homologous proteins from model organisms Caenorhabditis elegans Escherichia coli K-12 (gram negative) Arabidopsis thaliana Mus musculus Bacillus subtilis str. 168 (gram positive)
Phylogeny.fr provides tools for preparing multiple sequence alignments and phylogenetic trees
Multiple sequence alignments show regions of conservation Identical amino acids are shown in blue – conservative changes in grey
Tree Dyn generates a phylogenetic tree Length of branches reflects time since divergene from a node Bootstrap values predict reliability of nodes in the tree (max = 1.0) Length corresponds to 600 million years
Weblogo program provides a graphical depiction of multiple sequence alignments Sizes of different amino acids reflects the frequency with which a particular amino acid is found at the position – note the positions of amino acids with high BLOSUM scores