Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

Similar presentations


Presentation on theme: "Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun."— Presentation transcript:

1 Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun

2 Agenda 1. Introduction 2. What are the problems? 3. What are other people doing? 4. Indexed Genomic Retrieval with CAFÉ 5. Experimental Results 6. Conclusion

3 A T GC 1. Introduction Biological sequence databases contain several sequences of both DNA and Protein. DNA (Deoxyribonucleic Acid) is the primary genetic material in all living organisms –A molecule composed of two complementary nucleotide strands connected by base pairs that each base will pair with only one another: adenine (A) pairs with thymine (T) guanine (G) pairs with cytosine (C)

4 1. Introduction (1) A DNA sequence consists of –4 alphabets : A G C T –1 extra alphabet : N for unknown bases DNA sequence database > gi|1786692|gb|AE000155|ECAE000155 Escherichia coli, tesA, ybbA genes from base s 510705 to 522297 (section 45 of 400) of the complete genome TAGAATAGATGAGAATTAGTCTGTTCTACGAAATAGACGAGAATTAGTCTAGTCTAAAT AGACTAGAAATAGTCTAGTCTACGAAATAGACTAGAAATAGCCTAGTTCTGTTCTACGA AATAGACTAGAAATAGTCTAGTCTACG > gb|L02373|ECORHSCA Escherichia coli Rhs core genes, complete cds TAGAATAGATGAGAATTAGTCTGTTCTACGAAATAGACGAGAATTAGTCTAGTCTAAAT AGACTAGAAATAGTCTAGTCTACGAAATAGACTAGAAAATAGACTAGAAATAGTCTAGT CTACGAAATAGACTAGAAATAGCCTAGTTCTGTT : Alphabet ‘ > ’ separates each sequence and identifies its information

5 2. What are the problems? 2.1 Databases and query sequences contain low quality sequences therefore all techniques also must improve accuracy of querying results 2.2 All techniques also require long computation time

6 2.1 Low quality DNA sequences Substitution, Insertions, Deletions –Exact-match is not very efficient –Similarity search is required All algorithms will find all segment pairs whose scores must be improved by insertions and deletions Query: 3 LTRYCA - -GFTSLLKCNDADTIYDG 28 | | | | | | | | | | | | | | | | | | | Subject : 3325 LTRYCAPAGFXALLKCNDADT--DG 3350

7 2.2 Long computation time required Various and huge data size of database A database contains many different sequences, of variable lengths which requires local similarity for database search

8 3. What are other people doing? 3.1 SSERACH Algorithm –Using Dynamic Programming (DP) Very Slow, Very sensitive 3.2 BLAST Algorithm –Blast 1.4 (Old version): ungapped alignment Speed, sensitive –Blast 2.0 (New version): gapped alignment High Speed, less sensitive 3.3 FASTA Algorithm –Using DP-based Techniques: gapped alignment Slow, more sensitive

9 Edit distance and Dynamic Programming Assume that the given two sequences are A and B –n and m are the length of sequence A and sequence B, respectively –s (a n,b m ): similarity score between two aligned sequence a and b –Identical aligned pairs have a positive score 1 and non-identical pairs have a score 0 –Distance Matric D : D i,0 = D j,0 = 0 for i = 0,1,…,n and j = 0,1,…,m –Time complexity is O(n*m) D i-1,j D i,j = max D i,j-1 D i-1,j-1 + s(a i,b j ) { } 3.1 SSEARCH Algorithm

10 3.1 SSEARCH Algorithm (1) Example: Pairwise alignment via DP –Sequence a : ACGACA –Sequence b : AGCAC - A G C A C - 0 0 0 0 0 0 A 0 1 1 1 1 1 C 0 1 1 2 2 2 G 0 1 2 2 2 2 A 0 1 2 2 3 3 C 0 1 2 3 3 4 A 0 1 2 3 4 4 sequencebsequenceb sequence a Possible results of 3 alignments (1) a: ACGACA - b: A -G -CAC (2) a: ACG -ACA b: A -GCAC - (3) a: A -CGACA b: AGC -AC - Insert Delete Match d i-1,j-1 d i-1,j d i,j-1 d i,j

11 3.2 BLAST Algorithm for DNA Sequence A : Length N and Sequence B : Length M M Similarity Scores for DNA: Match = 5, Mismatch = -4 (WU-BLAST) Match = 1, Mismatch = -3 (NCBI) M Scanning for exact matches The list of words hit extending..... N W=12 Keyword Tree AC T A T A C GTC G C 12354 : :::: Generating Keyword Tree Note: Extension consumes > 90% of all processing times.

12 3.3 FASTA Algorithm for DNA Sequence A : Length N and Sequence B : Length M M Scanning for exact matches The list of words hit..... N W=12 Keyword Tree AC T A T A C GTC G C 12354 : :::: Generating Keyword Tree M N Alignment subsequences

13 4. Indexed Genomic Retrieval with CAFÉ 4.1 Indexing with Café 4.2 Coarse Searching with Café (Filtering) 4.3 Fine Searching with Café as the method of FASTA

14 4.1 Indexing with CAFÉ Inverted indexes consist of two component: –A search structure –Posting lists Example of an inverted index ACCC 12,(3:144,154,962), 38,(2:47,1045) The pattern occurs –3 times in the 12 th sequence, at offsets 144,154,and 962 –2 times in the 38 th sequence, at offsets 47 and 1045 These indices are compressed for reducing space described in detail elsewhere.

15 4.2 Coarse Searching with CAFÉ A novel Ranking technique using the index structure Score for ranking: COMBINED = COVERAGE- k*(LENGTH-COVERAGE) COVERAGE = 9 LENGTH = 9 COVERAGE = 21 LENGTH = 55 COVERAGE = 6 LENGTH = 55

16 Example: Ranking by CAFÉ Homologous -chain hemoglobin Human - Chimpanzee Human - Rat Human - Potato

17 5. Experimental Results 5.1 Test Data 5.2 Space 5.3 Retrieval Effectiveness 5.4 Speed

18 5.1 Test Data PIR Database for assessing the accuracy of search system. GenBank Database for assessing speed and index space requirements.

19 5.2 Space Uncompressed index size ~9.7 times the collection size Compressed index size (Café index) ~2.2 times the collection size The retrieval of uncompressed nucleotide data reduces the speed of Café system

20 5.3 Retrieval Effectiveness

21 5.4 Speed

22 6. Conclusion Café system affords much faster query evaluation than exhaustive searching. Better accuracy than the most widely used search tool, BLAST 2. Café indices are smaller than the annotated source databases and the indices of previous indexed systems.

23


Download ppt "Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun."

Similar presentations


Ads by Google