Presentation is loading. Please wait.

Presentation is loading. Please wait.

Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC.

Similar presentations


Presentation on theme: "Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC."— Presentation transcript:

1 Proteins, Pair HMMs, and Alignment

2 CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII M (+1,+1) I (+1, 0) J (0, +1) Alignments correspond 1-to-1 with sequences of states M, I, J

3 CS262 Lecture 8, Win06, Batzoglou Let’s score the transitions -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII M (+1,+1) I (+1, 0) J (0, +1) Alignments correspond 1-to-1 with sequences of states M, I, J s(x i, y j ) -d -e

4 CS262 Lecture 8, Win06, Batzoglou Alignment with affine gaps – state version Dynamic Programming: M(i, j):Optimal alignment of x 1 …x i to y 1 …y j ending in M I(i, j): Optimal alignment of x 1 …x i to y 1 …y j ending in I J(i, j): Optimal alignment of x 1 …x i to y 1 …y j ending in J The score is additive, therefore we can apply DP recurrence formulas

5 CS262 Lecture 8, Win06, Batzoglou Alignment with affine gaps – state version Initialization: M(0,0) = 0; M(i, 0) = M(0, j) = - , for i, j > 0 I(i,0) = d + i  e;J(0, j) = d + j  e Iteration: M(i – 1, j – 1) M(i, j) = s(x i, y j ) + max I(i – 1, j – 1) J(i – 1, j – 1) e + I(i – 1, j) I(i, j) = max d + M(i – 1, j) e + J(i, j – 1) J(i, j) = max d + M(i, j – 1) Termination: Optimal alignment given by max { M(m, n), I(m, n), J(m, n) }

6 CS262 Lecture 8, Win06, Batzoglou Brief introduction to the evolution of proteins Protein sequence and structure Protein classification Phylogeny trees Substitution matrices

7 CS262 Lecture 8, Win06, Batzoglou Muscle cells and contraction

8 CS262 Lecture 8, Win06, Batzoglou Actin and myosin during muscle movement

9 CS262 Lecture 8, Win06, Batzoglou Actin structure

10 CS262 Lecture 8, Win06, Batzoglou Actin sequence Actin is ancient and abundant  Most abundant protein in cells  1-2 actin genes in bacteria, yeasts, amoebas  Humans: 6 actin genes  -actin in muscles;  -actin,  -actin in non-muscle cells ~4 amino acids different between each version MUSCLE ACTIN Amino Acid Sequence 1 EEEQTALVCD NGSGLVKAGF AGDDAPRAVF PSIVRPRHQG VMVGMGQKDS YVGDEAQSKR 61 GILTLKYPIE HGIITNWDDM EKIWHHTFYN ELRVAPEEHP VLLTEAPLNP KANREKMTQI 121 MFETFNVPAM YVAIQAVLSL YASGRTTGIV LDSGDGVSHN VPIYEGYALP HAIMRLDLAG 181 RDLTDYLMKI LTERGYSFVT TAEREIVRDI KEKLCYVALD FEQEMATAAS SSSLEKSYEL 241 PDGQVITIGN ERFRGPETMF QPSFIGMESS GVHETTYNSI MKCDIDIRKD LYANNVLSGG 301 TTMYPGIADR MQKEITALAP STMKIKIIAP PERKYSVWIG GSILASLSTF QQMWITKQEY 361 DESGPSIVHR KCF

11 CS262 Lecture 8, Win06, Batzoglou A related protein in bacteria

12 CS262 Lecture 8, Win06, Batzoglou Relation between sequence and structure

13 CS262 Lecture 8, Win06, Batzoglou Protein Phylogenies Proteins evolve by both duplication and species divergence

14 CS262 Lecture 8, Win06, Batzoglou Protein Phylogenies – Example

15 CS262 Lecture 8, Win06, Batzoglou Structure Determines Function What determines structure? Energy Kinematics How can we determine structure? Experimental methods Computational predictions The Protein Folding Problem

16 CS262 Lecture 8, Win06, Batzoglou Primary Structure: Sequence The primary structure of a protein is the amino acid sequence

17 CS262 Lecture 8, Win06, Batzoglou Primary Structure: Sequence Twenty different amino acids have distinct shapes and properties

18 CS262 Lecture 8, Win06, Batzoglou Primary Structure: Sequence A useful mnemonic for the hydrophobic amino acids is "FAMILY VW"

19 CS262 Lecture 8, Win06, Batzoglou Secondary Structure: , , & loops  helices and  sheets are stabilized by hydrogen bonds between backbone oxygen and hydrogen atoms

20 CS262 Lecture 8, Win06, Batzoglou Tertiary Structure: A Protein Fold

21 CS262 Lecture 8, Win06, Batzoglou PDB Growth New PDB structures

22 CS262 Lecture 8, Win06, Batzoglou Only a few folds are found in nature

23 CS262 Lecture 8, Win06, Batzoglou Protein classification Number of protein sequences grows exponentially Number of solved structures grows exponentially Number of new folds identified very small (and close to constant) Protein classification can  Generate overview of structure types  Detect similarities (evolutionary relationships) between protein sequences  Help predict 3D structure of new protein sequences Morten Nielsen,CBS, BioCentrum, DTU SCOP release 1.69, Class# folds# superfamilies# families All alpha proteins218376608 All beta proteins144290560 Alpha and beta proteins (a/b)136222629 Alpha and beta proteins (a+b)279409717 Multi-domain proteins46 61 Membrane & cell surface478899 Small proteins75108171 Total94515392845 Classification of 25,973 protein structures in PDB

24 CS262 Lecture 8, Win06, Batzoglou Protein world Protein fold Protein structure classification Protein superfamily Protein family Morten Nielsen,CBS, BioCentrum, DTU

25 CS262 Lecture 8, Win06, Batzoglou Structure Classification Databases SCOP  Manual classification (A. Murzin)  scop.berkeley.edu scop.berkeley.edu CATH  Semi manual classification (C. Orengo)  www.biochem.ucl.ac.uk/bsm/cath www.biochem.ucl.ac.uk/bsm/cath FSSP  Automatic classification (L. Holm)  www.ebi.ac.uk/dali/fssp/fssp.html www.ebi.ac.uk/dali/fssp/fssp.html Morten Nielsen,CBS, BioCentrum, DTU

26 CS262 Lecture 8, Win06, Batzoglou Major classes in SCOP Classes  All  proteins  All  proteins   and  proteins (  /  )   and  proteins (  +  )  Multi-domain proteins  Membrane and cell surface proteins  Small proteins  Coiled coil proteins Morten Nielsen,CBS, BioCentrum, DTU

27 CS262 Lecture 8, Win06, Batzoglou All  : Hemoglobin (1bab) Morten Nielsen,CBS, BioCentrum, DTU

28 CS262 Lecture 8, Win06, Batzoglou All  : Immunoglobulin (8fab) Morten Nielsen,CBS, BioCentrum, DTU

29 CS262 Lecture 8, Win06, Batzoglou  Triosephosphate isomerase (1hti) Morten Nielsen,CBS, BioCentrum, DTU

30 CS262 Lecture 8, Win06, Batzoglou  : Lysozyme (1jsf) Morten Nielsen,CBS, BioCentrum, DTU

31 CS262 Lecture 8, Win06, Batzoglou Families Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity) Families are further subdivided into Proteins Families are divided into Species  The same protein may be found in several species Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU

32 CS262 Lecture 8, Win06, Batzoglou Superfamilies Proteins which are (remotely) evolutionarily related  Sequence similarity low  Share function  Share special structural features Relationships between members of a superfamily may not be readily recognizable from the sequence alone Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU

33 CS262 Lecture 8, Win06, Batzoglou Folds >~50% secondary structure elements arranged in the same order in sequence and in 3D No evolutionary relation Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU

34 CS262 Lecture 8, Win06, Batzoglou Substitutions of Amino Acids Mutation rates between amino acids have dramatic differences!

35 CS262 Lecture 8, Win06, Batzoglou Substitution Matrices BLOSUM matrices: 1.Start from BLOCKS database (curated, gap-free alignments) 2.Cluster sequences according to > X% identity 3.Calculate A ab : # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes 4.Estimate P(a) = (  b A ab )/(  c≤d A cd ); P(a, b) = A ab /(  c≤d A cd )

36 CS262 Lecture 8, Win06, Batzoglou Probabilistic interpretation of an alignment An alignment is a hypothesis that the two sequences are related by evolution Goal: Produce the most likely alignment Assert the likelihood that the sequences are indeed related

37 CS262 Lecture 8, Win06, Batzoglou A Pair HMM for alignments M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2  1 –      This model generates two sequences simultaneously Match/Mismatch state M: P(x, y) reflects substitution frequencies between pairs of amino acids Insertion states I, J: P(x), P(y) reflect frequencies of each amino acid  : set so that 1/2  is avg. length before next match  : set so that 1/(1 –  ) is avg. length of a gap M Model M optional

38 CS262 Lecture 8, Win06, Batzoglou A Pair HMM for unaligned sequences I P(x i ) J P(y j ) 11 Two sequences are independently generated from one another P(x, y | R) = P(x 1 )…P(x m ) P(y 1 )…P(y n ) =  i P(x i )  j P(y j ) R Model R

39 CS262 Lecture 8, Win06, Batzoglou To compare ALIGNMENT vs. RANDOM hypothesis Every pair of letters contributes:M (1 – 2  ) P(x i, y j ) when matched  P(x i ) P(y j ) when gappedR P(x i ) P(y j ) in random model Focus on comparison of P(x i, y j ) vs. P(x i ) P(y j ) M P(x i, y j ) I P(x i ) J P(y j ) 1 – 2  1 –     I P(x i ) J P(y j ) 1 1

40 CS262 Lecture 8, Win06, Batzoglou To compare ALIGNMENT vs. RANDOM hypothesis Idea: We will divide alignment score by the random score, and take logarithms Let P(x i, y j ) s(x i, y j ) = log ––––––––– + log (1 – 2  ) P(x i ) P(y j )  (1 –  ) P(x i ) d = – log ––––––––––––– (1 – 2  ) P(x i )  P(x i ) e = – log –––––– P(x i ) = Defn substitution score = Defn gap initiation penalty = Defn gap extension penalty

41 CS262 Lecture 8, Win06, Batzoglou The meaning of alignment scores The Viterbi algorithm for Pair HMMs corresponds exactly to global alignment DP with affine gaps V M (i, j) = max { V M (i – 1, j – 1), V I ( i – 1, j – 1) – d, V j ( i – 1, j – 1) } + s(x i, y j ) V I (i, j) = max { V M (i – 1, j) – d, V I ( i – 1, j) – e } V J (i, j) = max { V M (i – 1, j) – d, V I ( i – 1, j) – e }  s(.,.)~how often a pair of letters substitute one another   1/mean length of next gap   1/mean arrival time of next gap

42 CS262 Lecture 8, Win06, Batzoglou The meaning of alignment scores Match/mismatch scores: P(x i, y j ) s(a, b)  log –––––––––– (ignore log(1 – 2  ) for the moment) P(x i ) P(y j ) Example: DNA regions between human and mouse genes have average conservation of 80% 1.What is the substitution score for a match? P(a, a) + P(c, c) + P(g, g) + P(t, t) = 0.8  P(x, x) = 0.2 P(a) = P(c) = P(g) = P(t) = 0.25 s(x, x) = log [ 0.2 / 0.25 2 ] = 1.68 What is the substitution score for a mismatch? P(a, c) +…+P(t, g) = 0.2  P(x, y  x) = 0.2/12 = 0.167 s(x, y  x) = log[ 0.167 / 0.25 2 ] = -1.42 What ratio matches/(matches + mism.) gives score 0? 1.67/(1.67+1.42) = 54 %(~halfway between random and conserved model)

43 CS262 Lecture 8, Win06, Batzoglou Substitution Matrices BLOSUM matrices: 1.Start from BLOCKS database (curated, gap-free alignments) 2.Cluster sequences according to > X% identity 3.Calculate A ab : # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes 4.Estimate P(a) = (  b A ab )/(  c≤d A cd ); P(a, b) = A ab /(  c≤d A cd )

44 CS262 Lecture 8, Win06, Batzoglou BLOSUM matrices BLOSUM 50 BLOSUM 62 (The two are scaled differently)


Download ppt "Proteins, Pair HMMs, and Alignment. CS262 Lecture 8, Win06, Batzoglou A state model for alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC."

Similar presentations


Ads by Google