Presentation is loading. Please wait.

Presentation is loading. Please wait.

2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.

Similar presentations


Presentation on theme: "2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept."— Presentation transcript:

1 2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept

2 2016/1/27Summer Course2

3 2016/1/27Summer Course3 FASTA: http://www.ebi.ac.uk/Tools/fasta33/index.html

4 2016/1/27Summer Course4 FastP and FastA FastA is an algorithm that attempts to speed up string matching over the standard optimal alignment. The FastA algorithm is implemented in the following 6 stages: –Locate hot spots –Find the 10 best regions in the matrix –Score using a substitution matrix –Combine initial regions from different diagonals –Optimal alignment –Presentation

5 2016/1/27Summer Course5

6 2016/1/27Summer Course6 BLAST: http://blast.ncbi.nlm.nih.gov/Blast.cgi

7 2016/1/27Summer Course7 BLAST The BLAST database consists of three files for every FastA file input. –The first contains all of the sequence headers, textual information about the amino acid or nucleotide sequence. –The second contains the compressed sequences (2 bits for each nucleotide, 5 bits for each amino acid). –The third file contains an index of the compressed sequences so that they can be matched with the corresponding headers. The program runs in 3 rounds. –Database Scanning (table search or Finite state machine) –Seed Growing –Combining Alignments

8 2016/1/27Summer Course8

9 2016/1/27Summer Course9 Pattern matching

10 2016/1/27Summer Course10 (Character to Character Comparison)

11 2016/1/27Summer Course11

12 2016/1/27Summer Course12

13 2016/1/27Summer Course13

14 2016/1/27Summer Course14

15 2016/1/27Summer Course15

16 2016/1/27Summer Course16

17 2016/1/27Summer Course17

18 2016/1/27Summer Course18

19 2016/1/27Summer Course19

20 2016/1/27Summer Course20

21 2016/1/27Summer Course21

22 2016/1/27Summer Course22

23 2016/1/27Summer Course23

24 2016/1/27Summer Course24

25 2016/1/27Summer Course25 (Under a preprocessing, path)

26 2016/1/27Summer Course26

27 2016/1/27Summer Course27

28 2016/1/27Summer Course28

29 2016/1/27Summer Course29

30 2016/1/27Summer Course30

31 2016/1/27Summer Course31

32 2016/1/27Summer Course32

33 2016/1/27Summer Course33

34 2016/1/27Summer Course34 Sliding Window Comparison

35 2016/1/27Summer Course35 Sliding Windows Coding the sequence –DNA/RNA: A: 00, T: 01, G: 10, C: 11 –Protein: 20 amino acid K-tuple overlapping sliding windows Sorting –Bucket Sort

36 2016/1/27Summer Course36 Table Search

37 2016/1/27Summer Course37 Table Search Indexing table –overlapping or non-overlapping –Indexing for the text or patterns How to reduce the table size? How to do the search? How to do the filtration?

38 2016/1/27Summer Course38 Approximation string matching? (It still is very hard to do…)

39 2016/1/27Summer Course39 Bio-Problems SNP finding? ESTs align to whole genome? Genome assembly? Consensus and signature pattern finding? Motif finding?

40 2016/1/27Summer Course40 Part II: Advance Concept Indexing Methods for Pattern Search and Motif Finding problems

41 2016/1/27Summer Course41

42 2016/1/27Summer Course42 BLAT: http://genome.ucsc.edu/cgi-bin/hgBlat?command=start

43 2016/1/27Summer Course43 BLAT Non-overlapping indexing Table Exact and approximation match (by statistical method) Order concept

44 2016/1/27Summer Course44

45 2016/1/27Summer Course45

46 2016/1/27Summer Course46

47 2016/1/27Summer Course47 Using Single UMs for indexing table

48 2016/1/27Summer Course48

49 2016/1/27Summer Course49 Multiple-Unique Marker

50 2016/1/27Summer Course50

51 2016/1/27Summer Course51 Sandwich DP

52 2016/1/27Summer Course52

53 2016/1/27Summer Course53

54 2016/1/27Summer Course54 MEME: http://meme.nbcr.net/meme/intro.html

55 2016/1/27Summer Course55 (not the traditional motif definition)

56 2016/1/27Summer Course56 Degenerate motif discovery problem Given a set of sequences S = {S 1, S 2, …, S m | S i belongs to {A, G, C, T}* for all i} and three nonnegative integers k, l and d, find all degenerate (l, d)-motifs, each of which has occurrences in at least k sequences in S. A degenerate (l, d)-motif is defined as a pattern of length l over the IUPAC code with no more than d degenerate positions. (A degenerate position is a position occupied by a character other than A, G, C or T) e.g. ARATTYT degenerate (7,2)-motif ( 參考補充資料 )

57 2016/1/27Summer Course57 New Challenge Solexa and 454 short reads New hardware support


Download ppt "2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept."

Similar presentations


Ads by Google