Presentation is loading. Please wait.

Presentation is loading. Please wait.

近似搜索 邹权 博士、助理教授

Similar presentations


Presentation on theme: "近似搜索 邹权 博士、助理教授"— Presentation transcript:

1 http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授 http://datamining.xmu.edu.cn

2 Outline  Global alignment  Local alignment  BLAST

3 http://datamining.xmu.edu.cn  why compare sequences?  sequence comparison: operation consisting of finding which parts of the sequences are alike and which parts differ / Algorithms for an efficient solution

4 http://datamining.xmu.edu.cn TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT

5 http://datamining.xmu.edu.cn  Two notions Similarity: a measure of how similar two sequences are Alignment: a basic operation to compare two sequences, a way of placing one sequence above the other in order to make clear the correspondence between similar characters or substrings from the sequences.

6 http://datamining.xmu.edu.cn comparing two sequences alignments involving:  global comparisons: entire sequences  local comparisons: just substrings of sequences dynamic programming (DP)

7 http://datamining.xmu.edu.cn global comparison- example example of aligning  GACGGATTAG  GATCGGAATAG  GA – CGGATTAG  GATCGGAATAG  an extra T; a change from A to T; space: dash

8 http://datamining.xmu.edu.cn global comparison- the basic algorithm Definitions  Alignment: insertion of spaces: same size creating a correspondence: one over the other Both space are not allowed (Spaces can be inserted in beginning or end)  Scoring function : a measure of similarity between elements ; a match: +1/ identical characters a mismatch: -1/ distinct characters a space: -2/ Scoring system: to reward matches and penalize mismatches and spaces

9 http://datamining.xmu.edu.cn global comparison- the basic algorithm  GA – CGGATTAG  GATCGGAATAG  Example: total score is 6  similarity : sim(s, t) maximum alignment score; many alignments with similarity  best alignment alignment with similarity

10 http://datamining.xmu.edu.cn Basic DP algorithm for comparison of two sequences  number of alignment between two sequences: exponential  Efficient algorithm DP: prefixes: shorter to larger Idea: (m+1)*(n+1) array: entry (i, j) is similarity between s  1..i  and t  1..j  p(i, j)=+1 if s[i]=t[j], and -1 if s[i] ≠ t[j]: upper left corners

11 http://datamining.xmu.edu.cn

12 0 0 0 -2 1 -4 2 -6 3 1 -2 1 -4 2 -6 3 -8 4 -5 1 -3 1 A A A C AGC -4 -2 0 1 -2 -3

13 http://datamining.xmu.edu.cn local comparison Problem:  local alignment between s and t: an alignment between a substring of s and a substring of t Algorithm: to find the highest scoring local alignment between two sequences

14 http://datamining.xmu.edu.cn local comparison Idea:  Data structure: an (m+1)×(n+1) array; entry: holding the highest score of an alignment between a suffix of s[1..i] and a suffix of t[1..j].  Initialization First row and column: initialized with zeros ← for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero.

15 http://datamining.xmu.edu.cn

16

17

18 Global alignment

19 http://datamining.xmu.edu.cn Local vs. Global Alignment (cont ’ d)  Global Alignment  Local Alignment — better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc

20 http://datamining.xmu.edu.cn Local Alignment: Example Global alignment Local alignment Compute a “mini” Global Alignment to get Local

21 http://datamining.xmu.edu.cn semiglobal comparison Summary  Forgiving initial spaces: initializing certain positions with zero  Forgiving final spaces: looking for maximum along certain positions Place where spaces are not charged for Action Beginning of first sequenceInitialize first row with zeros End of first sequenceLook for maximum in last row Beginning of second sequenceInitialize first column with zeros End of second sequenceLook for maximum in last column

22 http://datamining.xmu.edu.cn

23 saving space Computing sim(s, t) Algorithm BestScore input: sequence s and t output: vector a m ← |s| n ← |t| for j ← 0 to n do a[j] ← j×g for i ← 1 to m do old ← a[0] a[0] ← i×g for j ← 1 to n do temp ← a[j] a[j] ← max(a[j]+g, old+p(i,j), a[j-1]+g) old ← temp

24 http://datamining.xmu.edu.cn An optimal alignment in linear space Idea: Divide and conquer strategy Fix position i in s, and consider what matching s[i] in alignment, two possibilities: 1, The symbol t[j] will match s[i], for some j in 1..n (3.6) 2, a space between t[j] and t[j+1] will match s[i], for some j in 1..n (3.7) Recursive method 1, for fixed i 2, to decide which value of i to use in each recursive call: to pick i as close as possible to the middle of sequence

25 http://datamining.xmu.edu.cn saving space

26 http://datamining.xmu.edu.cn BLAST/Lucene  步骤 为数据库建立倒排索引 查询倒排索引 扩展检验  问题 K 值选取 变长 Kmer

27 http://datamining.xmu.edu.cn Homework  为{ apple, please, eat, apply }建立关键字树,并画出所 有的失效链接  比对两个字符串( aaac 和 agc ),假定: match 得 2 分, mismatch-1 分,空格 -2 分,画出动态规划表和回溯路径, 并给出针对该回溯路径的比对方式  简述 BLAST 的主要思想  为字符串 “abababc” 计算每一位的 sp 和 sp‘ 值


Download ppt "近似搜索 邹权 博士、助理教授"

Similar presentations


Ads by Google