Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas
Other databases NCBI BLAST –Basic Local Alignment Search Tool –Multiple programs for sequence searching and comparisons Gene Expression Omnibus (GEO) –maintained by NCBI –contains output of gene expression experiments
Links GenBank ( ExPASy ( SwissProt ( GO ( PubMed ( MeSH browser ( NCBI Blast ( NCBI GEO ( Human Protein Atlas (
Assignment Search the above databases for information on a gene/protein of your choice Briefly report your findings (90 seconds) next Tuesday, September 30 Examples: interleukin-N (e.g., 3), elastase, thrombin, creatine kinase, myosin-N (e.g., 2)
Sequences Sequences of symbols central to bioinformatics –DNA –RNA –proteins Fixed alphabet (size 4 for DNA/RNA, 20 for proteins)
Sequence similarity Important for many biological problems Examples –Similar primary structure in proteins implies similar form and function –Similar sequences in genes / proteins imply homologues across organisms –Similar short sequences lead to motif finding –Similarities between gene regions can be used for phylogenetic classification
How to measure similarity Given two sequences S and T, we look into ways to derive T from S using elementary operations –Substitution (change a letter) –Deletion –Insertion Process is reversible (S→T and T→S) Many ways, some obviously more efficient
Edit distance Each elementary operation is assigned a cost Overall cost is the sum of the costs for each operation taken (linear model) The edit distance between two strings is the minimum total cost among all possible sequences of operations that transform S into T
Alignment An equivalent way to measuring edit distance is to align the two sequences An alignment extends the sequences S and T into S ′ and T ′ using the same alphabet plus “-” (the space character), and matches S ′ [i] with T ′ [i]
Definitions A string is a finite sequence of characters from a finite alphabet Σ The length of a string S, denoted |S|, is the number of characters it contains (can be 0) S[i] is the i-th character of S A subsequence of a string S is the string formed by omitting a number of characters from S (order of characters does not change)
Defining alignment formally An alignment is the mapping of two strings S and T from alphabet Σ into strings S′ and T′ where –The alphabet of S′ and T′ is Σ plus “-” –S is a subsequence of S′. All characters in S′ not in this subsequence must be “-”. –T is a subsequence of T′. All characters in T′ not in this subsequence must be “-”. –|S′| = |T′| –There is no i for which S′ [i] = T′ [i] = “-”
Example alignment Sequences: GCGCATGGATTGAGCGA TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A
Alignment operations -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: u Perfect matches u Mismatches u Insertions & deletions (indel)
Alignments are not unique For example, compare: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A to GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA--
Measuring alignment quality For each position i in the alignment, calculate the scoring function σ(S′[i], T′[i]) The scoring function depends only on the symbols S ′[i] and T′[i], not on position A very simple scoring function might be – σ(x, x) = +1 for x a letter – σ(x, y) = –2 for x,y different letters – σ(x, -) = σ(-, x) = -1 for indel
Overall alignment score Defined as the sum of the applicable values of the scoring function As with our definition of edit distance, this is a linear model
Scoring functions Usually based on how similar the two symbols are Derived from confusion probabilities In biology, chemically similar amino-acids have lower penalties for substitution In speech recognition, “p”→ “b” costs less than “p”→ “r” Cost of indels depends on application
Comparing alignments -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A 4 indel, 13 matches, 2 mismatches score: GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- 12 indel, 5 matches, 6 mismatches score: -19
Optimal alignment An alignment which maximizes the overall alignment score is called optimal Often, there is more than one optimal alignment for two strings –depends on sophistication of scoring function The optimal alignment score can be used as a similarity value
Finding the optimal alignment Simple algorithm: Construct all possible alignments, score them, and pick the best How many alignments are there for two strings of length n and m?