Sequence alignment BI420 – Introduction to Bioinformatics BI420 Fall 2012 Department of Biology, Boston College
Biologically significant alignment 1. Find two evolutionarily related sequences (subunits of human hemoglobin) in GenBank: http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi hba_human hbb_human 2. Save sequences on the Desktop and rename: hba_human.fasta & hbb_human.fasta
Biologically significant alignment 3. Visit a web-based pair-wise alignment program: http://artedi.ebc.uu.se/programs/pairwise.html 4. Upload our two proteins:
Biologically significant alignment 5. Create a pair-wise alignment between the two protein sequences:
Biologically plausible alignment Retrieve another sequence, leghemoglobin: Leghemoglobin Create a pair-wise alignment with human hemoglobin A:
Biologically plausible alignment http://en.wikipedia.org/wiki/Leghemoglobin
Spurious alignment Retrieve the sequence of a human BRCA1 gene variant, clearly not related to hemoglobin: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein&cmd=search&term=NP_009225.1 Make the pair-wise alignment: Examples from: Biological sequence analysis. Durbin, Eddy, Krogh, Mitchison
How Alignment Works
Alignment types How do we align the words: CRANE and FRAME? CRANE || | 3 matches, 2 mismatches How do we align words that are different in length? COELACANTH || ||| P-ELICAN-- COELACANTH || ||| -PELICAN-- 5 matches, 2 mismatches, 3 gaps In this case, if we assign +1 points for matches, and -1 for mismatches or gaps, we get 5 x 1 + 1 x (-1) + 3 x (-1) = 0. This is the alignment score. Examples from: BLAST. Korf, Yandell, Bedell
Finding the “best” alignment COELACANTH | ||| PE-LICAN-- COELACANTH || P-EL-ICAN- COELACANTH PELICAN-- S=-2 S=-6 S=-10 COELACANTH || ||| P-ELICAN-- S=0
JACKALOPE ANTELOPE JACKALOPE JACKA---LOPE -ANTELOPE ----ANTELOPE More mismatches More gaps Choice depends on score function
Global vs. local alignment Aligning words: SHAKE and SPEARE 1. Global alignment: aligning the two sequences along their entire length (even if it means adding many “gaps”): SH-AKE | | | SPEARE SHAKE--- | | SP--EARE -OR- 1. Local alignment: aligning only a nicely matching section between the two sequences (possibly leaving the ends un-aligned): SHAKE | | SPEARE SHAKE SPEARE Example from: Higgs and Attwood
MATLAB example – global alignment MATLAB bioinformatics toolbox sequence analysis demo: Aligning pairs of sequences >> s1 = 'ACGATT’ >> s2 = 'CCGACTA’ >> [score, ga] = nwalign(s1,s2) score = 7.3333 ga = ACGA-TT ||| |: CCGACTA
MATLAB example – local alignment MATLAB bioinformatics toolbox sequence analysis demo: Aligning pairs of sequences >> s1 = 'ACGATT’ >> s2 = 'CCGACTA’ >> [score, sa] = swalign(s1,s2) score = 10 sa = CGATT ||| | CGACT
Score Function + gap score g = -6 Pair-wise amino-acid scores S(ai,bj) (PAM250 scoring scheme) plus gap score g. Example from: Higgs and Attwood
Global alignment – Needleman-Wunsch Exact recursion scheme to calculate scores from already known scores: { H(i-1,j-1) + S(ai,bj) diagonal H(i,j) = best of: H(i-1,j) – g vertical H(i,j-1) – g horizontal Example from: Higgs and Attwood
Global alignment – Needleman-Wunsch Example: Align the two sequences SHAKE and SPEARE Example from: Higgs and Attwood
Global alignment – Needleman-Wunsch Initialization (filling the top row and left column from gap scores): Example from: Higgs and Attwood
Global alignment – Needleman-Wunsch Filling cell (1,1): Example from: Higgs and Attwood
Global alignment – Needleman-Wunsch Filling the rest of the cells (i,j): Example from: Higgs and Attwood
Global alignment – Needleman-Wunsch Tracing back to read out the alignment: Best global alignment: S-HAKE SPEARE Example from: Higgs and Attwood
Global alignment – Needleman-Wunsch The Needleman-Wunsch procedure is exhaustive. Every possible alignment is considered by the algorithm. So it is guaranteed to find the best global alignment. Example from: Higgs and Attwood
Local alignment – Smith-Waterman Smith-Waterman algorithm find the optimal LOCAL alignment. It works similarly to the Needleman-Wunsch GLOBAL alignment algorithm. Recursion scheme changes: 1. if the best score for a cell is negative, we replace it by 0 (start over) 2. gaps at the boundary are ignored they get 0 score { H(i-1,j-1) + S(ai,bj) diagonal H(i,j) = best of: H(i-1,j) – g vertical H(i,j-1) – g horizontal 0 start over Example from: Higgs and Attwood
Local alignment – Smith-Waterman Initialization Example from: Higgs and Attwood
Local alignment – Smith-Waterman Initialization Example from: Higgs and Attwood
Local alignment – Smith-Waterman Filling the cells Example from: Higgs and Attwood
Local alignment – Smith-Waterman Trace-back: Find path that contains the highest score Best local alignment: SHAKE SPEARE Example from: Higgs and Attwood Example: Align the two sequences: TTCAC and CTCAA using scores +1 for match and -1 for either gap or mismatch.
Local alignment – Smith-Waterman The Smith-Waterman procedure is also exhaustive. Every possible alignment is considered by the algorithm. So it is guaranteed to find the best local alignment. Example from: Higgs and Attwood
Example of a scoring matrix for Amino Acids The scoring matrix describes the scores for amino acid matches/mismatches. Scores are affected by biochemical similarity of amino acids. Note: this is not an alignment matrix!
Similar algorithms can be used for multiple alignment The multiple alignment of 24 hexokinase protein sequences from various species. However, real multiple alignment programs (e.g. clustalw) are usually heuristic, rather than exact
Applications of Alignment
Alignment is used for mapping sequence reads to the genome
Alignment is used in similarity search Alignment: determining how sequences have descended from a common ancestor Similarity search: determining which sequences are related to one another. Requires scoring of each alignment. query database
Alignment Exercises
Visualizing pair-wise alignments Visit a web server running a dot-plotter: http://bioweb.pasteur.fr/seqanal/interfaces/dotmatcher.html Upload hba_human and hbb_human, and create dot-plot:
MATLAB example MATLAB bioinformatics toolbox sequence analysis demo: Aligning pairs of sequences