Download presentation
Presentation is loading. Please wait.
1
Sequence alignment Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics
2
Biologically significant alignment http://artedi.ebc.uu.se/programs/pairwise.html hba_human hbb_human
3
Biologically plausible alignment
4
Spurious alignment (BRCA1 variant) Examples from: Biological sequence analysis. Durbin, Eddy, Krogh, Mitchison
5
Alignment types Examples from: BLAST. Korf, Yandell, Bedell How do we align the words: CRANE and FRAME? CRANE || | FRAME 3 matches, 2 mismatches How do we align words that are different in length? COELACANTH || ||| P-ELICAN-- COELACANTH || ||| -PELICAN-- 5 matches, 2 mismatches, 3 gaps In this case, if we assign +1 points for matches, and -1 for mismatches or gaps, we get 5 x 1 + 1 x (-1) + 3 x (-1) = 0. This is the alignment score.
6
Finding the “best” alignment COELACANTH || ||| P-ELICAN-- COELACANTH | ||| PE-LICAN-- COELACANTH || P-EL-ICAN- COELACANTH PELICAN-- S=-2 S=-6S=-10 S=0
7
Global alignment – Needleman-Wunsch Example from: Higgs and Attwood Aligning words: SHAKE and SPEARE
8
Local alignment – Smith-Waterman Example from: Higgs and Attwood
9
Visualizing pair-wise alignments
10
Sequence similarity and scoring Match-mismatch-gap penalties: e.g. Match = 1 Mismatch = -5 Gap = -10 Scoring matrices
11
Multiple alignments clustalW
12
Anchored multiple alignment
13
Similarity searching vs. alignment Alignment Similarity search query database
14
The BLAST algorithms ProgramDatabaseQueryTypical Uses BLASTNNucleotide Mapping oligonucleotides, amplimers, ESTs, and repeats to a genome. Identifying related transcripts. BLASTPProtein Identifying common regions between proteins. Collecting related proteins for phylogenetic analysis. BLASTXProteinNucleotideFinding protein-coding genes in genomic DNA. TBLASTNNucleotideProteinIdentifying transcripts similar to a known protein (finding proteins not yet in GenBank). Mapping a protein to genomic DNA. TBLASTXNucleotide Cross-species gene prediction. Searching for genes missed by traditional methods.
15
BLAST report
16
http://www.ncbi.nih.gov/BLAST/ gi|7428631
17
The BLAST algorithm Sequence alignment takes place in a 2-dimensional space where diagonal lines represent regions of similarity. Gaps in an alignment appear as broken diagonals. The search space is sometimes considered as 2 sequences and somtimes as query x database. Global alignment vs. local alignment –BLAST is local Maximum scoring pair (MSP) vs. High-scoring pair (HSP) –BLAST finds HSPs (usually the MSP too) Gapped vs. ungapped –BLAST can do both
18
The BLAST algorithm RGD17 KGD14 QGD13 RGE13 EGD12 HGD12 NGD12 RGN12 AGD11 MGD11 RAD11 RGQ11 RGS11 RND11 RSD11 SGD11 TGD11 BLOSUM62 neighborhood of RGD T=12 Speed gained by minimizing search space Alignments require word hits Neighborhood words W and T modulate speed and sensitivity
19
Word length
20
2-hit seeding Alignments tend to have multiple word hits. Isolated word hits are frequently false leads. Most alignments have large ungapped regions. Requiring 2 word hits on the same diagonal (of 40 aa for example), greatly increases speed at a slight cost in sensitivity.
21
Extension of the seed alignments Alignments are extended from seeds in each direction. Extension is terminated when the maximum score drops below X. The quick brown fox jumps over the lazy dog. The quiet brown cat purrs when she sees him. Text example match +1 mismatch -1 no gaps
22
BLAST statistics >gi|23098447|ref|NP_691913.1| (NC_004193) 3-oxoacyl-(acyl carrier protein) reductase [Oceanobacillus iheyensis] Length = 253 Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1 Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49 How significant is this similarity?
23
Scoring the alignment Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI 4027 VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49 4 4 S (score)
24
The Karlin-Altschul equation A minor constant Expected number of alignments Length of query Length of database Search space Raw score Scaling factor Normalized score The “Expect” or “E-value” The “P-value”
25
The sum-statistics Sum statistics increases the significance (decreases the E- value) for groups of consistent alignments.
26
The sum-statistics The sum score is not reported by BLAST!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.