Chap. 5 AA & Scoring Matrix

Slides:



Advertisements
Similar presentations
Proteins: Structure reflects function….. Fig. 5-UN1 Amino group Carboxyl group carbon.
Advertisements

Review.
Amino Acids PHC 211.  Characteristics and Structures of amino acids  Classification of Amino Acids  Essential and Nonessential Amino Acids  Levels.
A Ala Alanine Alanine is a small, hydrophobic
Review of Basic Principles of Chemistry, Amino Acids and Proteins Brian Kuhlman: The material presented here is available on the.
Amino Acids, Peptides, Proteins Functions of proteins: Enzymes Transport and Storage Motion, muscle contraction Hormones Mechanical support Immune protection.
Proteins Function and Structure.
Metabolic fuels and Dietary components Lecture - 2 By Dr. Abdulrahman Al-Ajlan.
5’ C 3’ OH (free) 1’ C 5’ PO4 (free) DNA is a linear polymer of nucleotide subunits joined together by phosphodiester bonds - covalent bonds between.
An overview of amino acid structure Topic 2. Biomacromolecule A naturally occurring substance of large molecular weight e.g. Protein, DNA, lipids etc.
Introduction to Protein JL LO & HC Lee 2000 June-July.
Amino Acids, Peptides, Protein Primary Structure Chapter 3.
Amino Acids, Peptides, Protein Primary Structure
Introduction to Bioinformatics Algorithms Sequence Alignment.
Amino Acids, Peptides, Protein Primary Structure
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Molecular Techniques in Molecular Systematics. DNA-DNA hybridisation -Measures the degree of genetic similarity between pools of DNA sequences. -Normally.
Introduction to bioinformatics
Sequence similarity.
©CMBI 2001 A Ala Alanine Alanine is a small, hydrophobic residue. Its side chain, R, is just a methyl group. Alanine likes to sit in an alpha helix,it.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring matrices Identity PAM BLOSUM.
Sequence Alignments Revisited
You Must Know How the sequence and subcomponents of proteins determine their properties. The cellular functions of proteins. (Brief – we will come back.
Pairwise Sequence Alignment (PSA)
Proteins account for more than 50% of the dry mass of most cells
Protein Synthesis. DNA RNA Proteins (Transcription) (Translation) DNA (genetic information stored in genes) RNA (working copies of genes) Proteins (functional.
Proteins account for more than 50% of the dry mass of most cells
CHAPTER 12 PROTEIN SYNTHESIS AND MUTATIONS -RNA -PROTEIN SYNTHESIS -MUTATIONS.
©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
LESSON 4: Using Bioinformatics to Analyze Protein Sequences PowerPoint slides to accompany Using Bioinformatics : Genetic Research.
AMINO ACIDS.
Proteins – Amides from Amino Acids
Amino Acids are the building units of proteins
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
Fig Second mRNA base First mRNA base (5 end of codon) Third mRNA base (3 end of codon)
Amino acids. Essential Amino Acids 10 amino acids not synthesized by the body arg, his, ile, leu, lys, met, phe, thr, trp, val Must obtain from the diet.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
NOTES: 2.3 part 2 Nucleic Acids & Proteins. So far, we’ve covered… the following MACROMOLECULES: ● CARBOHYDRATES… ● LIPIDS… Let’s review…
Macromolecules of Life Proteins and Nucleic Acids
RNA 2 Translation.
Amino Acids ©CMBI 2001 “ When you understand the amino acids, you understand everything ”
Proteins.
Chapter 3 Proteins.
Step 3: Tools Database Searching
Amino Acids  Amino Acids are the building units of proteins. Proteins are polymers of amino acids linked together by what is called “ Peptide bond” (see.
Amino acids Common structure of 19 AAs H3N+H3N+ COO - R H C Proline.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Amino acids Proof. Dr. Abdulhussien Aljebory College of pharmacy
Sequence similarity, BLAST alignments & multiple sequence alignments
Amino acids.
Protein Sequence Alignments
Proteins.
The Interface of Biology and Chemistry
Chapter 3 Proteins.
Fig. 5-UN1  carbon Amino group Carboxyl group.
Proteins Genetic information in DNA codes specifically for the production of proteins Cells have thousands of different proteins, each with a specific.
The 20 amino acids.
Translation.
The 20 amino acids.
The Chemical Building Blocks of Life
Example of regression by RBF-ANN
Proteins Proteins have many structures, resulting in a wide range of functions Proteins do most of the work in cells and act as enzymes 2. Proteins are.
“When you understand the amino acids,
Presentation transcript:

Chap. 5 AA & Scoring Matrix

DNA Bases s R Y W M K B N H V D A T G C A T G C A T G C A T G C A T

A G T C Models of nucleotide substitution transition Purine => transversion transversion Pyrimidine => T C transition

Jukes-Cantor (JC) Kimura 2P Kimura

Point Mutation (Substitution) Point mutation – simplest form of mutation and occurs all over DNA sequences Transition – mutation within purine (A,G) or pyrimidine (C,T/U) Transversion – mutation between nt groups Effects depend on where mutations occur Non-coding region – no effect on proteins, and neutral But may have significant effects if occurring in control region Coding region Synonymous substitution when a mutation does not change AA Non-synonymous AA is replaced by another stop codon is introduced

Other Mutations Indel mutation Small indels of a single base of a few bases are frequent Particularly frequent with repeated sequences GCGC…: insertion of extra GC or deletion cause slight slippage CAG repeated region in huntingtin protein can expand, causing Huntington’s disease Indels can cause frame shift, if indels are not multiples of three Gene inversion Whole genes are copied to offspring in reverse direction Translocation Whole genes can be deleted from one genome and inserted into another

Amino Acids General structure of amino acids an amino group a carboxyl group α-carbon bonded to a hydrogen and a side-chain group, R R determines the identity of particular amino acid R: large white and gray C: black Nitrogen: blue Oxygen: red Hydrogen: white

AA Groups Classification of R groups Polar/nonpolar Acidic/basic Polar share electron bonds unequally O-H bond is polar: O is more electro-negative and bonding electrons are closer to O C-H is nonpolar Acidic/basic Element Electro- negativty Oxygen 3.5 Nitrogen 3.0 Sulfur 2.6 Carbon 2.5 Phosphorus 2.2 Hydrogen 2.1

Group 1: Nonpolar (hydrophobic) Sometimes, Gly (G) is included because C-H bond is nonpolar

Group 2: Polar Side chains are electronically neutral (uncharged) Ser (S), Thr (T), Cys (C), Asn (N), Gln (Q), Try (Y) Asn (N) and Gln (Q) are consider derivatives of group 3 Asp (D) and Glu (E)

Group 3: Acidic Side chains have carboxyl group Asp (D) and Glu (E) Side chains are negatively charged

Group 4: Basic His (H), Lys (K), Arg (R) Side chain is positively charged His (H), Lys (K), Arg (R)

Physico-Chemical Properties Vol. Alanine Ala A 67 Arginine Arg R 148 Asparagine Asn N 96 Aspartic Asp D 91 Cysteine Cys C 86 Glutamine Gln Q 114 Glycine Gly G 48 Histidine His H 118 Isoleucine Ile I 124 Leucine Leu L Lysine Lys K 135 Methionine Met M Phenyl. Phe F Proline Prot P 90 Serine Ser S 73 Threonine Thr T 93 Tryptophan Trp W 163 Tyrosine Y 141 Valine Val V 105 Mean 109 Physico-chemical properties of AA determine protein structures bioinformatics can be used via a pattern recognition Properties (1) Size in volume Volume occupied by side groups is important (also for molecular evolution), and difficult to substitute a large AA for a small one Van der Waals radius (volume until atoms are pushed to repulsion) is used to measure the volume of the sphere (in Å3) W has 3.4 times the volume of G

(5) pH of isoelectric point of AA (pI) (2) Partial Vol. Measure expanded volume in solution when dissolved (3) Bulkiness The ratio of side chain volume to its length Measure of average cross-sectional area of the side chain Relevant to protein folding (4) Polarity index Electrostatic force acting on its surrounding at a distance of 10 Å (5) pH of isoelectric point of AA (pI) Acidic Asp and Glu have pI in 2-3: negatively charged at neutral pH due to ionization of COOH group to COO- -- need to put them in an acid solution to shift equilibrium and balance this charge (side chain is charged +) Basic (Arg, Lys and His) has pI >7 (charged -) All others have uncharged side chains (pl. in 5-6)

(6) Hydrophobicity (7) Surface area (8) Fraction of area When molecules are dissolved in water, hydrogen-bonded structure is disrupted Polar AA residues can form hydrogen bonds with water –hydrophilic Non-polar that cannot form the bonds – hydrophobic Polar disrupts the structure less than non-polar Polar is usually at the exterior of a structure, non-polar, interior Hydrophobicity (hydropathy) scale: estimate of difference in free energy of AA when buried in hydrophobic environment of the interior of a protein in water solution (+ for hydrophobic – costs free energy to take residue out of protein and put it in water) (7) Surface area Surface area of AA exposed (accessible) to water in an unfolded peptide chain and become buried when the chain folds Relevant to protein folding (8) Fraction of area Fraction of the accessible surface area that is buried in the interor in a set of known crystal structures Hydrophobic residues have a larger fraction

Red: acidic Orange: basic Green: polar Yellow: non-polar Vol. Bulk pI Hydro Surf2 Frac Alanine Ala A 67 11.5 0.0 6.0 1.8 113 0.74 Arginine Arg R 148 14.3 52.0 10.8 -4.5 241 0.64 Asparagine Asn N 96 12.3 3.4 5.4 -3.5 158 0.63 Aspartic Asp D 91 11.7 49.7 2.8 151 0.62 Cysteine Cys C 86 13.5 1.5 5.1 2.5 140 0.91 Glutamine Gln Q 114 14.5 3.5 5.7 189 Glu. Acid Glu E 109 13.6 49.9 3.2 183 Glycine Gly G 48 -0.4 85 0.72 Histidine His H 118 13.7 51.6 7.6 -3.2 194 0.78 Isoleucine Ile I 124 21.4 0.1 4.5 182 0.88 Leucine Leu L 3.8 180 0.85 Lysine Lys K 135 49.5 9.7 -3.9 211 0.52 Methionine Met M 16.3 1.4 1.9 204 Phenyl. Phe F 0.4 5.5 2.9 218 Proline Prot P 90 17.4 1.6 6.3 -1.6 143 Serine Ser S 73 9.5 1.7 -0.8 122 0.66 Threonine Thr T 93 15.8 -0.7 146 0.70 Tryptophan Trp W 163 21.7 2.1 5.9 -0.9 259 Tyrosine Y 141 18.0 -1.3 229 0.76 Valine Val V 105 21.6 4.2 160 0.86 Mean 15.4 -0.5 175 Red: acidic Orange: basic Green: polar Yellow: non-polar

“Universal” Genetic Code

Genetic Code 2

Properties Purine (A,G) is heavier than Pyrimidine (C,T) Transition within a type (Purines or Pyrimidines) is more likely than Translation between types All AAs have more than one codon, except for Met and Trp Codons for an AA are clustered Two codons for an AA – same in the first 2 positions and differ only by transition at the 3rd position Four codons – differ only in the 3rd position Six codons – form one four-codon box and one two-codon box

Genetic Code X X X X Degeneracy is controlled by GC content of codons G-C binding is stronger First two bases (doublets) are GC – form four codon boxes (red X) Doublets are AU – split boxes (blue X) Doublets are mixed X X X X Purine 2nd base is pyrimidine – four codon boxes, split otherwise Larger purine at the 2nd position reduces binding at the 3rd position A doublet forms a four-codon box, its ‘conjugate’ forms a split box Conjugate – opposite size and opposite number of hydrogen bonds; A-C and G-U are conjugates

Genetic Code Five most hydrophobic – Phe, Leu, Ile, Met, Val U at the 2nd position Three most similar – Leu, Ile, Val Single-base mutation at 1st position Six most hydrophilic – His,Gln,Asn,Lys,Asp, Glu A at the 2nd position (Tyr is hydrophobic and has A in 2nd position)

Evolution of Genetic Code From what the current Genetic Code became stable ? Robin Knight www.cs.uml.edu/~kim/580/99_knight.pdf

AA Substitutions 1978 – Dayhoff, Schwartz, Orcutt Which AA substitutions are observed to occur when two homologous protein sequences are aligned ? From aligned sequences of 71 families of closely related proteins (sharing more than 85% of sequences), tabulated 1572 substitutions AA substitutions are accepted by natural selection occurs when A gene undergoes a DNA mutation to translate to a different AA and does not significantly alter the gene function The entire species adopts the change as a predominant form of the protein Frequencies can represent expected mutation over short evolutionary distances Called PAM (Point Accepted Mutation) PAM unit corresponds to one AA change per 100 residues (1% divergence)

Dayhoff counting Most freq. subs.: Glu to Asp (both acidic)

Protein Substitution Rates Example Six letters: I, K, L, Q, T, V Seven sequences Form an evolutionary tree A: T L K K V Q K T B: T L K K V Q K T C: T L K K I Q K Q D: I I T K L Q K Q E: T I T K L Q K Q F: T L T K I Q K Q G: T L T Q I Q K Q

Protein Substitution Rates Determine AAij Count AA j being substituted by i I K L Q T V I - - 2 - 1 1 K - - - 1 1 - L 2 - - - - - Q - 1 - - 1 - T 1 1 - 1 - - V 1 - - - - -

Sub. Frequency to Score Matrix AA mutation prob. Mij : Prob. of original AA j mutating to AA i in one PAM distance PAM distance: unit of evolutionary divergence in which 1% of AA's have changed between two protein sequences Mij =  Aij /Ni (Ni count of amino acid i) -- normalized by the prob. of AA i occurring Pij(t) : Prob. that a site has AA i at time t when it had j at time 0 Pij(dt) = Mij

Mutation Prob. Matrix Each entry is scaled by 105 Two most freq. substitutions are highlighted

Sub. Frequency to Score Matrix 2. Mutation Prob. Matrix to Log-Odds Scoring qij : Prob. of aligning j to i pi: prob. of observing AA i by chance Odds – related to probability prob = 0: odds = 0 prob = 0.5: odds = 1 prob =0.75: odds = 3 (75:25) odds= prob/(1-prob) and prob = odds/(1+odds) Log-odd: sij = 10*log(qij/ pi) e.g., sED = 10*log(0.00398/0.062) Can add log-odd scores

PAM matrix assumptions 1992 – Jones, Taylor, Thornton (JTT) Most important assumption – each AA replacement is independent of previous mutations at the same position Matrix can be extrapolated into predicted substitution fequencies at longer evolutionary distances PAM1 multiplied by itself 100 times can represent what one would expect if there were 100 AA changes per 100 residues – PAM100 All sites are equally mutable independent of neighboring residues No consideration of conserved blocks or motifs Forces responsible for sequence evolution over shorter time span are identical to those for longer time spans 1992 – Jones, Taylor, Thornton (JTT) 59,190 substitutions in all sequences in Swiss-Prot

PAM1 matrix Mutation Prob. Matrix has Pij(t) PAM1 matrix for related proteins with 1% mutation = 99% identical between two sequences For distantly related proteins, other PAM matrices are used by successively multiplying PAM1 PAM 0 30 80 110 200 250 % identity 100 75 60 50 25 20

PAM 250 matrix

Pairwise Alignment vs. PAM Distance Two sequences 100 AAs After 80 PAM distances (80 mutations), 50 AAs are different After 250 hits, 20 AAs remain the same

PAM matrices Closely related: Human vs. Chimpanzee (100% AA identical) Distantly related: HBA vs. HBB (43% AA identical)

BLOSUM S.Henikoff and J.G. Henikoff Devised to perform best in identifying distant relationships Based on BLOCKS database of aligned protein sequences BLOcks Substitution Matrix (# of observed pairs of AA at any position)/(# of pairs expected from the overall AA frequencies) is computed from regions of closely related proteins alignable without gaps To avoid overweighting closely related sequences, groups of proteins with sequence identities higher than a threshold are replaced by either a single representative or a weighted average

BLOSUM 62 BLOSUM Threshold set at 62 Protein sequences sharing less than 62% identity Default BLAST

BLOSUM62 Most popular diagonal Off-diagonal Score for exact match W-W: score 11: because alignment of W between two sequences is rare Off-diagonal W (tryptophan) – Y (tyrosine): score 2 Positive score – occur more often than by chance, but replacement is not as good as if W is preserved (2 < 11) or if Y is preserved (2 < 7) W – V (Valine): score -3

PAM vs. BLOSUM