Download presentation
Presentation is loading. Please wait.
1
http://datamining.xmu.edu.cn 近似搜索 邹权 博士、助理教授 http://datamining.xmu.edu.cn
2
Outline Global alignment Local alignment BLAST
3
http://datamining.xmu.edu.cn why compare sequences? sequence comparison: operation consisting of finding which parts of the sequences are alike and which parts differ / Algorithms for an efficient solution
4
http://datamining.xmu.edu.cn TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT
5
http://datamining.xmu.edu.cn Two notions Similarity: a measure of how similar two sequences are Alignment: a basic operation to compare two sequences, a way of placing one sequence above the other in order to make clear the correspondence between similar characters or substrings from the sequences.
6
http://datamining.xmu.edu.cn comparing two sequences alignments involving: global comparisons: entire sequences local comparisons: just substrings of sequences dynamic programming (DP)
7
http://datamining.xmu.edu.cn global comparison- example example of aligning GACGGATTAG GATCGGAATAG GA – CGGATTAG GATCGGAATAG an extra T; a change from A to T; space: dash
8
http://datamining.xmu.edu.cn global comparison- the basic algorithm Definitions Alignment: insertion of spaces: same size creating a correspondence: one over the other Both space are not allowed (Spaces can be inserted in beginning or end) Scoring function : a measure of similarity between elements ; a match: +1/ identical characters a mismatch: -1/ distinct characters a space: -2/ Scoring system: to reward matches and penalize mismatches and spaces
9
http://datamining.xmu.edu.cn global comparison- the basic algorithm GA – CGGATTAG GATCGGAATAG Example: total score is 6 similarity : sim(s, t) maximum alignment score; many alignments with similarity best alignment alignment with similarity
10
http://datamining.xmu.edu.cn Basic DP algorithm for comparison of two sequences number of alignment between two sequences: exponential Efficient algorithm DP: prefixes: shorter to larger Idea: (m+1)*(n+1) array: entry (i, j) is similarity between s 1..i and t 1..j p(i, j)=+1 if s[i]=t[j], and -1 if s[i] ≠ t[j]: upper left corners
11
http://datamining.xmu.edu.cn
12
0 0 0 -2 1 -4 2 -6 3 1 -2 1 -4 2 -6 3 -8 4 -5 1 -3 1 A A A C AGC -4 -2 0 1 -2 -3
13
http://datamining.xmu.edu.cn local comparison Problem: local alignment between s and t: an alignment between a substring of s and a substring of t Algorithm: to find the highest scoring local alignment between two sequences
14
http://datamining.xmu.edu.cn local comparison Idea: Data structure: an (m+1)×(n+1) array; entry: holding the highest score of an alignment between a suffix of s[1..i] and a suffix of t[1..j]. Initialization First row and column: initialized with zeros ← for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero.
15
http://datamining.xmu.edu.cn
18
Global alignment
19
http://datamining.xmu.edu.cn Local vs. Global Alignment (cont ’ d) Global Alignment Local Alignment — better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc
20
http://datamining.xmu.edu.cn Local Alignment: Example Global alignment Local alignment Compute a “mini” Global Alignment to get Local
21
http://datamining.xmu.edu.cn semiglobal comparison Summary Forgiving initial spaces: initializing certain positions with zero Forgiving final spaces: looking for maximum along certain positions Place where spaces are not charged for Action Beginning of first sequenceInitialize first row with zeros End of first sequenceLook for maximum in last row Beginning of second sequenceInitialize first column with zeros End of second sequenceLook for maximum in last column
22
http://datamining.xmu.edu.cn
23
saving space Computing sim(s, t) Algorithm BestScore input: sequence s and t output: vector a m ← |s| n ← |t| for j ← 0 to n do a[j] ← j×g for i ← 1 to m do old ← a[0] a[0] ← i×g for j ← 1 to n do temp ← a[j] a[j] ← max(a[j]+g, old+p(i,j), a[j-1]+g) old ← temp
24
http://datamining.xmu.edu.cn An optimal alignment in linear space Idea: Divide and conquer strategy Fix position i in s, and consider what matching s[i] in alignment, two possibilities: 1, The symbol t[j] will match s[i], for some j in 1..n (3.6) 2, a space between t[j] and t[j+1] will match s[i], for some j in 1..n (3.7) Recursive method 1, for fixed i 2, to decide which value of i to use in each recursive call: to pick i as close as possible to the middle of sequence
25
http://datamining.xmu.edu.cn saving space
26
http://datamining.xmu.edu.cn BLAST/Lucene 步骤 为数据库建立倒排索引 查询倒排索引 扩展检验 问题 K 值选取 变长 Kmer
27
http://datamining.xmu.edu.cn Homework 为{ apple, please, eat, apply }建立关键字树,并画出所 有的失效链接 比对两个字符串( aaac 和 agc ),假定: match 得 2 分, mismatch-1 分,空格 -2 分,画出动态规划表和回溯路径, 并给出针对该回溯路径的比对方式 简述 BLAST 的主要思想 为字符串 “abababc” 计算每一位的 sp 和 sp‘ 值
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.