2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept
2016/1/27Summer Course2
2016/1/27Summer Course3 FASTA:
2016/1/27Summer Course4 FastP and FastA FastA is an algorithm that attempts to speed up string matching over the standard optimal alignment. The FastA algorithm is implemented in the following 6 stages: –Locate hot spots –Find the 10 best regions in the matrix –Score using a substitution matrix –Combine initial regions from different diagonals –Optimal alignment –Presentation
2016/1/27Summer Course5
2016/1/27Summer Course6 BLAST:
2016/1/27Summer Course7 BLAST The BLAST database consists of three files for every FastA file input. –The first contains all of the sequence headers, textual information about the amino acid or nucleotide sequence. –The second contains the compressed sequences (2 bits for each nucleotide, 5 bits for each amino acid). –The third file contains an index of the compressed sequences so that they can be matched with the corresponding headers. The program runs in 3 rounds. –Database Scanning (table search or Finite state machine) –Seed Growing –Combining Alignments
2016/1/27Summer Course8
2016/1/27Summer Course9 Pattern matching
2016/1/27Summer Course10 (Character to Character Comparison)
2016/1/27Summer Course11
2016/1/27Summer Course12
2016/1/27Summer Course13
2016/1/27Summer Course14
2016/1/27Summer Course15
2016/1/27Summer Course16
2016/1/27Summer Course17
2016/1/27Summer Course18
2016/1/27Summer Course19
2016/1/27Summer Course20
2016/1/27Summer Course21
2016/1/27Summer Course22
2016/1/27Summer Course23
2016/1/27Summer Course24
2016/1/27Summer Course25 (Under a preprocessing, path)
2016/1/27Summer Course26
2016/1/27Summer Course27
2016/1/27Summer Course28
2016/1/27Summer Course29
2016/1/27Summer Course30
2016/1/27Summer Course31
2016/1/27Summer Course32
2016/1/27Summer Course33
2016/1/27Summer Course34 Sliding Window Comparison
2016/1/27Summer Course35 Sliding Windows Coding the sequence –DNA/RNA: A: 00, T: 01, G: 10, C: 11 –Protein: 20 amino acid K-tuple overlapping sliding windows Sorting –Bucket Sort
2016/1/27Summer Course36 Table Search
2016/1/27Summer Course37 Table Search Indexing table –overlapping or non-overlapping –Indexing for the text or patterns How to reduce the table size? How to do the search? How to do the filtration?
2016/1/27Summer Course38 Approximation string matching? (It still is very hard to do…)
2016/1/27Summer Course39 Bio-Problems SNP finding? ESTs align to whole genome? Genome assembly? Consensus and signature pattern finding? Motif finding?
2016/1/27Summer Course40 Part II: Advance Concept Indexing Methods for Pattern Search and Motif Finding problems
2016/1/27Summer Course41
2016/1/27Summer Course42 BLAT:
2016/1/27Summer Course43 BLAT Non-overlapping indexing Table Exact and approximation match (by statistical method) Order concept
2016/1/27Summer Course44
2016/1/27Summer Course45
2016/1/27Summer Course46
2016/1/27Summer Course47 Using Single UMs for indexing table
2016/1/27Summer Course48
2016/1/27Summer Course49 Multiple-Unique Marker
2016/1/27Summer Course50
2016/1/27Summer Course51 Sandwich DP
2016/1/27Summer Course52
2016/1/27Summer Course53
2016/1/27Summer Course54 MEME:
2016/1/27Summer Course55 (not the traditional motif definition)
2016/1/27Summer Course56 Degenerate motif discovery problem Given a set of sequences S = {S 1, S 2, …, S m | S i belongs to {A, G, C, T}* for all i} and three nonnegative integers k, l and d, find all degenerate (l, d)-motifs, each of which has occurrences in at least k sequences in S. A degenerate (l, d)-motif is defined as a pattern of length l over the IUPAC code with no more than d degenerate positions. (A degenerate position is a position occupied by a character other than A, G, C or T) e.g. ARATTYT degenerate (7,2)-motif ( 參考補充資料 )
2016/1/27Summer Course57 New Challenge Solexa and 454 short reads New hardware support