Sequence alignment, Part 2

Sequence alignment, Part 2
BI420 – Introduction to Bioinformatics Sequence alignment, Part 2 BI420 Spring 2012 Department of Biology, Boston College

Similar algorithms can be used for multiple alignment
The multiple alignment of 24 hexokinase protein sequences from various species. However, real multiple alignment programs (e.g. clustalw) are usually heuristic, rather than exact

Applications of Alignment

Alignment is used for mapping sequence reads to the genome

Alignment is used in similarity search
Alignment: determining how sequences have descended from a common ancestor Similarity search: determining which sequences are related to one another. Requires scoring of each alignment. query database

Alignment Exercises

Visualizing pair-wise alignments
Visit a web server running a dot-plotter: Upload hba_human and hbb_human, and create dot-plot:

MATLAB example MATLAB bioinformatics toolbox sequence analysis demo:
Aligning pairs of sequences

BLAST Basic Local Alignment Search Tool

Purpose of BLAST Exact alignment algorithms, such as Needleman-Wunsch for global and Smith-Waterman for local, are slow: O(length1 * length2) . Alignment speed can be increased by using statistical properties of sequences to estimate alignment quality. This requires pre-processing of the sequence database.

The BLAST algorithms Program Database Query Typical Uses BLASTN
Nucleotide Mapping oligonucleotides, amplimers, ESTs, and repeats to a genome. Identifying related transcripts. BLASTP Protein Identifying common regions between proteins. Collecting related proteins for phylogenetic analysis. BLASTX Finding protein-coding genes in genomic DNA. TBLASTN Identifying transcripts similar to a known protein (finding proteins not yet in GenBank). Mapping a protein to genomic DNA. TBLASTX Cross-species gene prediction. Searching for genes missed by traditional methods.

BLAST report BLAST example with hba_Human
BLAST example with hba_Human

BLAST report

The BLAST algorithm Sequence alignment takes place in a 2d space where diagonal lines represent regions of similarity. Gaps break up the diagonals. The search space can be considered as seq1 vs seq2, or as seq1 vs a database of a sequences. Global alignment vs. local alignment BLAST is local Maximum scoring pair (MSP) vs. High-scoring pair (HSP) BLAST finds HSPs (usually the MSP too) Gapped vs. ungapped BLAST can do both

The BLAST algorithm Alignments require word (segment pair) hits
Database is preprocessed for word content of each sequence. This speeds up later calculations.

BLOSUM62 neighborhood of RGD
The BLAST algorithm BLOSUM62 neighborhood of RGD RGD 17 KGD 14 QGD 13 RGE 13 EGD 12 HGD 12 NGD 12 RGN 12 AGD 11 MGD 11 RAD 11 RGQ 11 RGS 11 RND 11 RSD 11 SGD 11 TGD 11 For a given word, assign a score to neighborhood words based on scoring matrix. W (word length) and T (threshold for a word match) modulate speed and sensitivity T=12

Word length As the threshold score for a word match is increased, there are fewer matches. This makes the search more specific, but less sensitive.

2-hit seeding Alignments often have multiple word hits in clusters.
Isolated word hits are frequently false leads. Most alignments have large ungapped regions. Requiring 2 word hits on the same diagonal greatly increases speed at a slight cost in sensitivity. Similar to paired-end read mapping concept.

Extension of the seed alignments
Alignments are extended from seeds in each direction. Extension is terminated when the maximum score drops below X. Example The quick brown fox jumps over the lazy dog. The quiet brown cat purrs when she sees him.

BLAST statistics How significant is this similarity?
>gi| |ref|NP_ | (NC_004193) 3-oxoacyl-(acyl carrier protein) reductase [Oceanobacillus iheyensis] Length = Score = 38.9 bits (89), Expect = 3e-05 Identities = 17/40 (42%), Positives = 26/40 (64%) Frame = -1 Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49 How significant is this similarity?

Scoring the alignment S (score) 4 -1 4
Query: 4146 VTGAGHGLGRAISLELAKKGCHIAVVDINVSGAEDTVKQI VTGA G+G+AI+ A +G + V D+N GA+ V++I Sbjct: 10 VTGAASGMGKAIATLYASEGAKVIVADLNEEGAQSVVEEI 49 4 -1 4 S (score)

Evaluating an alignment
How many alignments of a given score would be expected by chance, i.e. without common evolutionary history? We expect more chance hits when the search database or when the query sequence(s) is longer. For a higher score threshold, we expect fewer chance hits.

The Karlin-Altschul equation
The “Expect” or “E-value” A minor constant Scaling factor Normalized score Expected number of alignments Raw score Length of query Length of database Search space The “P-value”

The sum-statistics Multiple aligning regions from a single sequence should increase our belief that the sequence is evolutionarily related to our query. Sum statistics merge the significance (decrease the E-value) for groups of consistent alignments.

The sum-statistics The sum score is not reported by BLAST!

Sequence alignment, Part 2

Similar presentations

Presentation on theme: "Sequence alignment, Part 2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence alignment, Part 2

Similar presentations

Presentation on theme: "Sequence alignment, Part 2"— Presentation transcript:

Similar presentations

About project

Feedback