Sequence alignment & Substitution matrices By Thomas Nordahl

Sequence alignment & Substitution matrices By Thomas Nordahl

Sequence alignment Sequence alignment is the most important technique used in bioinformatics Infer properties from one protein to another Homologous sequences often have similar biological functions Most information can be deduced from a sequence if the 3D-structure is known 3D-structure determination is very time consuming (X-ray, NMR) Several mg of pure protein is required (> 100mg) Make crystal, solve structure, 1-3 years Large facilities are needed to produce X-ray Rotating anode or synchrotron Determining primary sequence is fast, cheap Structure more conserved than sequence

Growth of GenBank and WGS

Structures in PDB Genbank

Car parts – analogy to protein folds
A fold: major structural similarity

Protein class & folds A fold: major structural similarity

Structures in SCOP database
A fold: major structural similarity The “world” seems to consist of approx1400 protein folds. Until 2014 no new folds have been observed

What can we learn from sequence alignment
Find similar sequence from another organism Information from the known sequence can be inherited Layers of conserved information: Structure > function > sequence where, ‘>’ means more conserved than Structure (3D) is the most conserved feature Proteins with different function may still share the same structure Proteins with different may still share the same function Often same function if 40-50% sequence identity Often same protein fold if above 30% sequence identity A fold: major structural similarity

Sequence alignment M V S T A 1 M V S T A M A T S A Antal identiske aa, % id ? Alignment score using identity matrix? Similar amino acids can be substituted, therefore other types of substitution matrices are used.

Blosum matrices Blosum matrices are the most commonly used substitution matrices - Blosum50, Blosum62, blosum80 Symmetrical 20 x 20 matrix, where each element is the substitution score. Positive scores: Amino acids are likely to be aligned in a sequence alignment They share similar chemical characteristics Negative scores: Less likely substitution – but still occur. Zero Scores: Invariant Q) In an alignment what is the most likely amino acid that Arg will align to besides itself?

Log-odds scores Log-odds scores are given by
Log( Observation/Expected) The log-odd score of matching amino acid j with amino acid i in an alignment is where Pij is the frequency of observation i aligned with j, and Qi, Qj are the frequency if amino acids i and j in the data set. The log-odd score is (in bit units) Where, Log2(x)=logn(x)/logn(2) S has been normalized to half bits, therefore the factor 2

Example of a scoring matrix BLOSUM80
A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V Log-Odds scores Have been rounded off to integers

An example NAA = 14 1 2 3 4 seq1: V V A D seq2: A A A D seq3: D V A D
Sij = 2log2(Pij/(QiQj)) Pij can be calculated as Nij/(Sumij Nij), where Nij is the number of times amino acid i is aligned to amino acid j Sum Nij is the total number of all alignments Nij Qi is the frequency observed in alignment of amino acid i MSA – Multiple Sequemce Alignment How to calculate NAA seq1: V V A D seq2: A A A D seq3: D V A D seq4: D A A A NAA = 14

MSA – Multiple Sequemce Alignment
An example MSA – Multiple Sequemce Alignment NAA = 14 NAD = 5 NAV = 5 NDA = 5 NDD = 8 NDV = 2 NVA = 5 NVD = 2 NVV = 2 PAA = 14/48 PAD = 5/48 PAV = 5/48 PDA = 5/48 PDD = 8/48 PDV = 2/48 PVA = 5/48 PVD = 2/48 PVV = 2/48 1234 seq1: VVAD seq2: AAAD seq3: DVAD seq4: DAAA QA = 8/16 QD = 5/16 QV = 3/16

Example continued PAA = 0.29 QAQA = 0.25 PAD = 0.10 QAQD = 0.16
PAV = 0.10 PDA = 0.10 PDD = 0.17 PDV = 0.04 PVA = 0.10 PVD = 0.04 PVV = 0.04 QAQA = 0.25 QAQD = 0.16 QAQV = 0.09 QDQA = 0.16 QDQD = 0.10 QDQV = 0.06 QVQA = 0.09 QVQD = 0.06 QVQV = 0.03 1: VVAD 2: AAAD 3: DVAD 4: DAAA MSA QA=0.50 QD=0.31 QV=0.19

So what does this mean? PAA = 0.29 PAD = 0.10 PAV = 0.10 PDA = 0.10
PDD = 0.17 PDV = 0.04 PVA = 0.10 PVD = 0.04 PVV = 0.04 QAQA = 0.25 QAQD = 0.16 QAQV = 0.09 QDQA = 0.16 QDQD = 0.10 QDQV = 0.06 QVQA = 0.09 QVQD = 0.06 QVQV = 0.03 SAA = 0.44 SAD =-1.17 SAV = 0.30 SDA =-1.17 SDD = 1.54 SDV =-0.98 SVA = 0.30 SVD =-0.98 SVV = 0.49 BLOSUM is a log-likelihood matrix: Sij = 2log2(Pij/(QiQj))

The Scoring matrix A D V 0.44 -1.17 0.30 1.54 -0.98 0.49 1: VVAD
2: AAAD 3: DVAD 4: DAAA MSA

And what does the BLOSUMXX mean?
High Blosum values mean high similarity between clusters Conserved substitution allowed Low Blosum values mean low similarity between clusters Less conserved substitutions allowed

BLOSUM80 <Sii> = 9.4 <Sij> = -2.9
A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V <Sii> = 9.4 <Sij> = -2.9

BLOSUM30 Blosum30 <Sii> = 8.3 <Sij> = -1.16 Blosum80
A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V Blosum30 <Sii> = 8.3 <Sij> = -1.16 Blosum80 <Sii> = 9.4 <Sij> = -2.9

Sequence alignment & Substitution matrices By Thomas Nordahl

Similar presentations

Presentation on theme: "Sequence alignment & Substitution matrices By Thomas Nordahl"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence alignment & Substitution matrices By Thomas Nordahl

Similar presentations

Presentation on theme: "Sequence alignment & Substitution matrices By Thomas Nordahl"— Presentation transcript:

Similar presentations

About project

Feedback