Protein Sequence Alignments Week 6
Learning Objectives Identify conservative and non-conservative amino acid substitutions Know the difference between percent identity and percent similarity. Understand the concept of homology and the difference between orthologs and paralogs. Identify protein domains in a BLASTp output 5) Use a substitution matrix to determine protein alignment scores
Working with Proteins Introduction to Proteins: Amino Acid Sequence: primary structure Motifs and Domains—3D structure
Amino acids listed with abbreviations Three-letter abbreviation One-letter abbreviation Alanine Ala A Arginine Arg R Asparagine Asn N Aspartic Acid Asp D Cysteine Cys C Glutamine Gln Q Glutamic Acid Glu E Glycine Gly G Histidine His H Isoleucine Ile I Leucine Leu L Lysine Lys K Methionine Met M Phenylalanine Phe F Proline Pro P Serine Ser S Threonine Thr T Tryptophan Trp W Tyrosine Tyr Y Valine Val V Asparagine or aspartic acid Asx B Glutamine or glutamic acid Glx Z
The side chain determines the properties of the amino acid Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
Hydrophilic amino acids Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
Hydrophilic amino acids Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
Hydrophilic amino acids Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
Hydrophobic amino acids Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
Hydrophobic amino acids Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
Unique amino acids Lodish, H. et al. Molecular Cell Biology (New York; W.H. Freeman, 2000).
An organism’s evolutionary history is documented in its genome Figure 26.2 An unexpected family tree
Homologous genes found in different species are called orthologs Ancestral gene Ancestral species Speciation with divergence of gene Orthologous genes Species A Species B (a) Orthologous genes
Homologous genes within the same species are called paralogs Gene duplication and divergence Paralogous genes Species A after many generations (b) Paralogous genes
Two sequences can diverge over time 1 Deletion 2 Figure 26.8 Aligning segments of DNA Insertion
Two sequences can diverge over time 3 4 Figure 26.8 Aligning segments of DNA
How do we identify sequences that are related (homologous) from sequences that are similar due to chance (analogous)? An alignment of random sequences Figure 26.9 A molecular homoplasy
The sequence alignment score tells us the relatedness of two sequences >25% shared identity means two proteins are highly related Highly related proteins are potential homologs Homologs are two proteins that share a common ancestor—they originated from the same sequence but have changed over time (evolved from one another) Homologs must share similar 3D structure and perform similar functions Homologs within the same species are called paralogs, while homologs within different species are called orthologs
Without insulin humans develop the disease diabetes Beta cells of the pancreas secrete the hormone insulin into the blood Insulin enhances the transport of glucose into body cells and stimulates the liver to store glucose as glycogen
The structure of human Insulin PDB.org
The primary structure of human insulin
We can use protein blast (blastp) to find homologs of human insulin
Conserved domain found within insulin
Odobenus rosmarus divergens is the Walrus Captain Budd Christman, NOAA Corps - NOAA's Ark - Animals Collection Image ID: anim0022 ([1])
Alignment score generated by blastp from human insulin and walrus insulin Captain Budd Christman, NOAA Corps - NOAA's Ark - Animals Collection Image ID: anim0022 ([1])
Protein BLAST score—based on length, identical residues, conservative substitutions, mismatches and gaps. % Identity: The extent to which two amino acid sequences are invariant (how many residues are exact matches) % Similar: pairs of amino acid residues that are structurally or functionally related—connected by + signs (percent similar or positive) = all identical and similar matches Conservative substitutions occur when amino acids with similar biochemical properties are substituted for one another
Scoring the alignment of two sequences MALWTHLLPLLALLALWAPAPSRAFVNQ Captain Budd Christman, NOAA Corps - NOAA's Ark - Animals Collection Image ID: anim0022 ([1]) MALWMRLLPLLALLALWGPDPAAAFVNQ Homo sapiens
Percent identity uses match/mismatch scoring MALWTHLLPLLALLALWAPAPSRAFVNQ Captain Budd Christman, NOAA Corps - NOAA's Ark - Animals Collection Image ID: anim0022 ([1]) MALWMRLLPLLALLALWGPDPAAAFVNQ Homo sapiens Matches score as +1 Mismatches score as -1 Add the score of each pair of residues
Percent identity uses match/mismatch scoring MALWTHLLPLLALLALWAPAPSRAFVNQ Captain Budd Christman, NOAA Corps - NOAA's Ark - Animals Collection Image ID: anim0022 ([1]) MALWMRLLPLLALLALWGPDPAAAFVNQ Homo sapiens +1+1+1+1-1-1… 22 matches - 6 mismatches Matches score as +1 Mismatches score as -1 Add the score of each pair of residues
% similarity scoring assigns specific values to every substitution
Percent similarity uses a substitution matrix MALWTHLLPLLALLALWAPAPSRAFVNQ Captain Budd Christman, NOAA Corps - NOAA's Ark - Animals Collection Image ID: anim0022 ([1]) MALWMRLLPLLALLALWGPDPAAAFVNQ Homo sapiens
Non-conservative substitution Biochemical Properties of M and T Hydrophilic with a polar side group Hydrophobic Different biochemical properties mean that such a substitution could disrupt protein function and therefore is counted as a negative substitution Non-conservative substitution
The BLOSUM62 Scoring Matrix
Percent similarity uses a substitution matrix MALWTHLLPLLALLALWAPAPSRAFVNQ Captain Budd Christman, NOAA Corps - NOAA's Ark - Animals Collection Image ID: anim0022 ([1]) MALWMRLLPLLALLALWGPDPAAAFVNQ Homo sapiens
Conservative substitution Biochemical Properties of H and R Hydrophilic amino acids Similar biochemical properties mean that such a substitution is unlikely to disrupt protein function and therefore is counted as a neutral substitution Conservative substitution
The BLOSUM62 Scoring Matrix
Derivation of the substitution matrix Sij=log (qij)/pipj Sij is the Score in the substitution matrix of amino acid i being substituted for j qij is the observed frequency of the substitution of amino acid i with j
Derivation of the substitution matrix Sij=log (qij)/pipj Sij is the Score in the substitution matrix of amino acid i being substituted for j qij is the observed frequency of the substitution of amino acid i with j The observed frequency is found by comparing known sequences—aligning these sequences and calculating the frequency of substitutions
Derivation of the substitution matrix Sij=log (qij)/pipj Sij is the Score in the substitution matrix of amino acid i being substituted for j qij is the observed frequency of the substitution of amino acid i with j The observed frequency is found by comparing known sequences—aligning these sequences and calculating the frequency of substitutions This is done using different sequences and with different assumptions—leading to different scoring matrixes—We will use the BLOSUM62 matrix
Derivation of the substitution matrix Sij=log (qij)/pipj Sij is the Score in the substitution matrix of amino acid i being substituted for j qij is the observed frequency of the substitution of amino acid i with j pi is the frequency of i in the database pj is the frequency of j in the database pi pj is the probability of randomly pairing/aligning of i and j
Derivation of the substitution matrix Sij=log (qij)/pipj The substitution matrix is not based on the biochemical properties of the amino acids but by how often substitutions among two amino acids are seen If a substitution between two amino acids is seen a lot, then it is likely to maintain the function of the protein If a substitution between two amino acids is a rare event, then it is likely to disrupt the function of the protein
Conclusions Amino acid substitutions can be conservative and non-conservative (biochemical definition vs. statistical definition) Percent identity only calculates matches while percent similarity includes conservative substitutions. Homologs are sequences that share common ancestry; orthologs are homologs found in different species and paralogs are sequences found in the same species. A substitution matrix is used to calculate protein alignment scores
Worksheet