Sequence alignment Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics
Biologically significant alignment hbb_human 1. Find two truly related sequences (subunits of human hemoglobin) in GenBank: hba_human 2. Save sequences on the Desktop and rename: hba_human.fasta & hbb_human.fasta
Biologically significant alignment 4. Upload our two proteins: 3. Visit a web-based pair-wise alignment program:
Biologically significant alignment 5. Create a pair-wise alignment between the two protein sequences:
Biologically plausible alignment Leg hemoglobin Retrieve another sequence, leghemoglobin: Create a pair-wise alignment with human hemoglobin A:
Biologically plausible alignment
Spurious alignment Retrieve the sequence of a human BRCA1 gene variant, clearly not related to hemoglobin: Examples from: Biological sequence analysis. Durbin, Eddy, Krogh, Mitchison Make the pair-wise alignment:
Alignment types Examples from: BLAST. Korf, Yandell, Bedell How do we align the words: CRANE and FRAME? CRANE || | FRAME 3 matches, 2 mismatches How do we align words that are different in length? COELACANTH || ||| P-ELICAN-- COELACANTH || ||| -PELICAN-- 5 matches, 2 mismatches, 3 gaps In this case, if we assign +1 points for matches, and -1 for mismatches or gaps, we get 5 x x (-1) + 3 x (-1) = 0. This is the alignment score.
Finding the “best” alignment COELACANTH || ||| P-ELICAN-- COELACANTH | ||| PE-LICAN-- COELACANTH || P-EL-ICAN- COELACANTH PELICAN-- S=-2 S=-6S=-10 S=0
Global vs. local alignment Example from: Higgs and Attwood Aligning words: SHAKE and SPEARE 1. Global alignment: aligning the two sequences along their entire length (even if it means adding many “gaps”): SH-AKE | | | SPEARE SHAKE--- | SP--EARE -OR- 1. Local alignment: aligning only a nicely matching section between the two sequences (possibly leaving the ends un-aligned): SHAKE SPEARE SHAKE | | SPEARE
Global alignment – Needleman-Wunsch Example from: Higgs and Attwood Pair-wise amino-acid scores S(ai,bi) (PAM250 scoring scheme) plus gap score g. + gap score g = -6
Global alignment – Needleman-Wunsch Example from: Higgs and Attwood Recursion scheme to calculate scores from already known scores: H(i-1,j-1) + S(a i,b i ) diagonal H(i,j) = best of:H(i-1,j) – g vertical H(I,j-1) – g horizontal {
Global alignment – Needleman-Wunsch Example from: Higgs and Attwood Initialization (filling the top row and left column from gap scores): Align the two sequences: AAGATTCAC and CCGCTCAA
Global alignment – Needleman-Wunsch Example from: Higgs and Attwood Initialization (filling the top row and left column from gap scores):
Global alignment – Needleman-Wunsch Example from: Higgs and Attwood Filling cell (1,1):
Global alignment – Needleman-Wunsch Example from: Higgs and Attwood Filling the rest of the cells (i,j):
Global alignment – Needleman-Wunsch Example from: Higgs and Attwood Tracing back to read out the alignment: S-HAKE SPEARE Best global alignment:
Local alignment – Smith-Waterman Example from: Higgs and Attwood Recursion scheme changes: 1. if the best score for a cell is negative, we replace it by 0 (start over) 2. gaps at the boundary are ignored they get 0 score H(i-1,j-1) + S(a i,b i ) diagonal H(i,j) = best of:H(i-1,j) – g vertical H(I,j-1) – g horizontal 0 start over {
Local alignment – Smith-Waterman Example from: Higgs and Attwood Initialization
Local alignment – Smith-Waterman Example from: Higgs and Attwood Initialization Align the two sequences: AAGATTCAC and CCGCTCAA
Local alignment – Smith-Waterman Example from: Higgs and Attwood Filling the cells:
Local alignment – Smith-Waterman Example from: Higgs and Attwood Trace-back: SHAKE SPEARE Best local alignment:
Visualizing pair-wise alignments Visit a web server running a dot-plotter: Upload hba_human and hbb_human, and create dot-plot:
Scoring schemes Match-mismatch-gap penalties: e.g. Match = 1 Mismatch = -5 Gap = -10 Scoring matrices
Multiple alignments Fetch HXK (hexokinase) sequences from NCBI; save as hxk.fasta on the DesktopHXK
Multiple alignments Visit a web-hosted clustalW site (e.g.: and upload the HXK sequences
Multiple alignments The multiple alignment of 24 hexokinese protein sequences from various species
Anchored multiple alignment
Similarity searching vs. alignment Alignment Similarity search query database
The BLAST algorithms ProgramDatabaseQueryTypical Uses BLASTNNucleotide Mapping oligonucleotides, amplimers, ESTs, and repeats to a genome. Identifying related transcripts. BLASTPProtein Identifying common regions between proteins. Collecting related proteins for phylogenetic analysis. BLASTXProteinNucleotideFinding protein-coding genes in genomic DNA. TBLASTNNucleotideProteinIdentifying transcripts similar to a known protein (finding proteins not yet in GenBank). Mapping a protein to genomic DNA. TBLASTXNucleotide Cross-species gene prediction. Searching for genes missed by traditional methods.
BLAST report gi|
BLAST report
The BLAST algorithm Sequence alignment takes place in a 2-dimensional space where diagonal lines represent regions of similarity. Gaps in an alignment appear as broken diagonals. The search space is sometimes considered as 2 sequences and somtimes as query x database. Global alignment vs. local alignment –BLAST is local Maximum scoring pair (MSP) vs. High-scoring pair (HSP) –BLAST finds HSPs (usually the MSP too) Gapped vs. ungapped –BLAST can do both
The BLAST algorithm RGD17 KGD14 QGD13 RGE13 EGD12 HGD12 NGD12 RGN12 AGD11 MGD11 RAD11 RGQ11 RGS11 RND11 RSD11 SGD11 TGD11 BLOSUM62 neighborhood of RGD T=12 Speed gained by minimizing search space Alignments require word hits Neighborhood words W and T modulate speed and sensitivity
Word length
2-hit seeding Alignments tend to have multiple word hits. Isolated word hits are frequently false leads. Most alignments have large ungapped regions. Requiring 2 word hits on the same diagonal (of 40 aa for example), greatly increases speed at a slight cost in sensitivity.
Extension of the seed alignments Alignments are extended from seeds in each direction. Extension is terminated when the maximum score drops below X. The quick brown fox jumps over the lazy dog. The quiet brown cat purrs when she sees him. Text example match +1 mismatch -1 no gaps
BLAST statistics >gi| |ref|NP_ | (NC_004193) 3-oxoacyl-(acyl carrier protein) reductase [Oceanobacillus iheyensis] Length = 253 Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1 Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49 How significant is this similarity?
Scoring the alignment Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI S (score)
The Karlin-Altschul equation A minor constant Expected number of alignments Length of query Length of database Search space Raw score Scaling factor Normalized score The “Expect” or “E-value” The “P-value”
The sum-statistics Sum statistics increases the significance (decreases the E- value) for groups of consistent alignments.
The sum-statistics The sum score is not reported by BLAST!