近似搜索 邹权 博士、助理教授


Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Sequencing and Sequence Alignment
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Dynamic Programming and Biological Sequence Comparison Part I.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Multiple Sequence alignment Chitta Baral Arizona State University.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
1 Sequences comparison 1 Issues Similarity gives a measure of how similar the sequences are. Alignment is a way to make clear the correspondence between.
Class 2: Basic Sequence Alignment
Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Chapter 3 Computational Molecular Biology Michael Smith
1 第四章 Dynamic Programming 技术 邹权(博士)计算机科学系 Introduction F(n) = 1if n = 0 or 1 F(n-1) + F(n-2)if n > 1 n F(n)F(n) Pseudo.
Sequence comparison and database search.
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
Linear Equations 1.1 System of linear Equations
Sequence Alignment 11/24/2018.
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
近似搜索 邹权 博士、助理教授

Outline  Global alignment  Local alignment  BLAST

 why compare sequences?  sequence comparison: operation consisting of finding which parts of the sequences are alike and which parts differ / Algorithms for an efficient solution


 Two notions Similarity: a measure of how similar two sequences are Alignment: a basic operation to compare two sequences, a way of placing one sequence above the other in order to make clear the correspondence between similar characters or substrings from the sequences.

comparing two sequences alignments involving:  global comparisons: entire sequences  local comparisons: just substrings of sequences dynamic programming (DP)

global comparison- example example of aligning  GACGGATTAG  GATCGGAATAG  GA – CGGATTAG  GATCGGAATAG  an extra T; a change from A to T; space: dash

global comparison- the basic algorithm Definitions  Alignment: insertion of spaces: same size creating a correspondence: one over the other Both space are not allowed (Spaces can be inserted in beginning or end)  Scoring function : a measure of similarity between elements ; a match: +1/ identical characters a mismatch: -1/ distinct characters a space: -2/ Scoring system: to reward matches and penalize mismatches and spaces

global comparison- the basic algorithm  GA – CGGATTAG  GATCGGAATAG  Example: total score is 6  similarity : sim(s, t) maximum alignment score; many alignments with similarity  best alignment alignment with similarity

Basic DP algorithm for comparison of two sequences  number of alignment between two sequences: exponential  Efficient algorithm DP: prefixes: shorter to larger Idea: (m+1)*(n+1) array: entry (i, j) is similarity between s  1..i  and t  1..j  p(i, j)=+1 if s[i]=t[j], and -1 if s[i] ≠ t[j]: upper left corners


local comparison Problem:  local alignment between s and t: an alignment between a substring of s and a substring of t Algorithm: to find the highest scoring local alignment between two sequences

local comparison Idea:  Data structure: an (m+1)×(n+1) array; entry: holding the highest score of an alignment between a suffix of s[1..i] and a suffix of t[1..j].  Initialization First row and column: initialized with zeros ← for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero.

Global alignment

Local vs. Global Alignment (cont ’ d)  Global Alignment  Local Alignment — better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc

Local Alignment: Example Global alignment Local alignment Compute a “mini” Global Alignment to get Local

semiglobal comparison Summary  Forgiving initial spaces: initializing certain positions with zero  Forgiving final spaces: looking for maximum along certain positions Place where spaces are not charged for Action Beginning of first sequenceInitialize first row with zeros End of first sequenceLook for maximum in last row Beginning of second sequenceInitialize first column with zeros End of second sequenceLook for maximum in last column

saving space Computing sim(s, t) Algorithm BestScore input: sequence s and t output: vector a m ← |s| n ← |t| for j ← 0 to n do a[j] ← j×g for i ← 1 to m do old ← a[0] a[0] ← i×g for j ← 1 to n do temp ← a[j] a[j] ← max(a[j]+g, old+p(i,j), a[j-1]+g) old ← temp

An optimal alignment in linear space Idea: Divide and conquer strategy Fix position i in s, and consider what matching s[i] in alignment, two possibilities: 1, The symbol t[j] will match s[i], for some j in 1..n (3.6) 2, a space between t[j] and t[j+1] will match s[i], for some j in 1..n (3.7) Recursive method 1, for fixed i 2, to decide which value of i to use in each recursive call: to pick i as close as possible to the middle of sequence

saving space

BLAST/Lucene  步骤 为数据库建立倒排索引 查询倒排索引 扩展检验  问题 K 值选取 变长 Kmer

Homework  为{ apple, please, eat, apply }建立关键字树,并画出所 有的失效链接  比对两个字符串( aaac 和 agc ),假定: match 得 2 分, mismatch-1 分,空格 -2 分,画出动态规划表和回溯路径, 并给出针对该回溯路径的比对方式  简述 BLAST 的主要思想  为字符串 “abababc” 计算每一位的 sp 和 sp‘ 值