近似搜索 邹权 博士、助理教授
Outline Global alignment Local alignment BLAST
why compare sequences? sequence comparison: operation consisting of finding which parts of the sequences are alike and which parts differ / Algorithms for an efficient solution
TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC TCAGTAATTAATCATGCACCTATGTGGCGG AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT
Two notions Similarity: a measure of how similar two sequences are Alignment: a basic operation to compare two sequences, a way of placing one sequence above the other in order to make clear the correspondence between similar characters or substrings from the sequences.
comparing two sequences alignments involving: global comparisons: entire sequences local comparisons: just substrings of sequences dynamic programming (DP)
global comparison- example example of aligning GACGGATTAG GATCGGAATAG GA – CGGATTAG GATCGGAATAG an extra T; a change from A to T; space: dash
global comparison- the basic algorithm Definitions Alignment: insertion of spaces: same size creating a correspondence: one over the other Both space are not allowed (Spaces can be inserted in beginning or end) Scoring function : a measure of similarity between elements ; a match: +1/ identical characters a mismatch: -1/ distinct characters a space: -2/ Scoring system: to reward matches and penalize mismatches and spaces
global comparison- the basic algorithm GA – CGGATTAG GATCGGAATAG Example: total score is 6 similarity : sim(s, t) maximum alignment score; many alignments with similarity best alignment alignment with similarity
Basic DP algorithm for comparison of two sequences number of alignment between two sequences: exponential Efficient algorithm DP: prefixes: shorter to larger Idea: (m+1)*(n+1) array: entry (i, j) is similarity between s 1..i and t 1..j p(i, j)=+1 if s[i]=t[j], and -1 if s[i] ≠ t[j]: upper left corners
A A A C AGC
local comparison Problem: local alignment between s and t: an alignment between a substring of s and a substring of t Algorithm: to find the highest scoring local alignment between two sequences
local comparison Idea: Data structure: an (m+1)×(n+1) array; entry: holding the highest score of an alignment between a suffix of s[1..i] and a suffix of t[1..j]. Initialization First row and column: initialized with zeros ← for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero.
Global alignment
Local vs. Global Alignment (cont ’ d) Global Alignment Local Alignment — better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc
Local Alignment: Example Global alignment Local alignment Compute a “mini” Global Alignment to get Local
semiglobal comparison Summary Forgiving initial spaces: initializing certain positions with zero Forgiving final spaces: looking for maximum along certain positions Place where spaces are not charged for Action Beginning of first sequenceInitialize first row with zeros End of first sequenceLook for maximum in last row Beginning of second sequenceInitialize first column with zeros End of second sequenceLook for maximum in last column
saving space Computing sim(s, t) Algorithm BestScore input: sequence s and t output: vector a m ← |s| n ← |t| for j ← 0 to n do a[j] ← j×g for i ← 1 to m do old ← a[0] a[0] ← i×g for j ← 1 to n do temp ← a[j] a[j] ← max(a[j]+g, old+p(i,j), a[j-1]+g) old ← temp
An optimal alignment in linear space Idea: Divide and conquer strategy Fix position i in s, and consider what matching s[i] in alignment, two possibilities: 1, The symbol t[j] will match s[i], for some j in 1..n (3.6) 2, a space between t[j] and t[j+1] will match s[i], for some j in 1..n (3.7) Recursive method 1, for fixed i 2, to decide which value of i to use in each recursive call: to pick i as close as possible to the middle of sequence
saving space
BLAST/Lucene 步骤 为数据库建立倒排索引 查询倒排索引 扩展检验 问题 K 值选取 变长 Kmer
Homework 为{ apple, please, eat, apply }建立关键字树,并画出所 有的失效链接 比对两个字符串( aaac 和 agc ),假定: match 得 2 分, mismatch-1 分,空格 -2 分,画出动态规划表和回溯路径, 并给出针对该回溯路径的比对方式 简述 BLAST 的主要思想 为字符串 “abababc” 计算每一位的 sp 和 sp‘ 值