Mida kasutame sarnaste järjestuste leidmiseks: BLAST BLAST (PSI-BLAST) on optimiseeritud selleks et leida homolooge – ühise evolutsioonilise päritoluga järjestusi.
Sarnasuse mõõtmine valkudes Log-odd sarnasuste tabel A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
Sarnasuse mõõtmine nukleiinhapetes Identity matrix: A C G T A 1 0 0 0 C 0 1 0 0 G 0 0 1 0 T 0 0 0 1 Transition/transversion: A C G T A 3 -2 -1 -2 C -2 3 -2 -1 G -1 -2 3 -2 T -2 -1 -2 3 BLAST DNA matrix: A C G T A 1 -3 -3 -3 C -3 1 -3 -3 G -3 -3 1 -3 T -3 -3 -3 1
2. Vajadused on muutunud - Genoomi assambleerimine - Järjestuste lokaliseerimine genoomis - mRNA võrdlemine genoomi järjestusega Need probleemid ei vaja nii keerukat sarnasusmaatriksit ega afiinset gap-penalty mudelit
3. SNP-de lokaliseerimine 2 000 000 SNPd <-> inimese genoom 50-500bp <-> 3 000 000 000 bp SNP1 chr location SNP2 chr location SNP3 chr location SNP4 chr location SNP5 chr location SNP6 chr location SNP7 chr location SNP8 chr location SNP9 chr location SNP10 chr location ...
4. Milliseid tööriistu kasutada BLAST – väga aeglane MEGABLAST - aeglane SSAHA - ? BLAT - ? UNIMARKER (UM) - ?
5. SSAHA Sequence Search and Alignment by Hashing Algorithm Koostab tabeli (indeksi) kõigist genoomis olevatest “sõnadest” ja jätab meelde nende asukoha genoomis Tüüpiline sõna pikkus 10 nt. step 1 step 2 step 10
Bioinformaatika rakendusi SSAHA TTTTTTAAAAGAGAAAAAATTCTGACGGGGGCATAACTGGAGAATAAAGTGATAAAATACTGCTGAAACAAAAAGTCATCTG Otsing: GGGGGCATAACTGGAGAAGGAGAA 16345 8817780 3322, 4448624 3323,1188375 3324 3325, 443565 3326 sõnade koguarv 3*109 , 4 bytes each = 12 GB
Bioinformaatika rakendusi UM TTTTTTAAAAGAGAAAAAATTCTGACGGGGGCATAACTGGAGAATAAAGTGATAAAATACTGCTGAAACAAAAAGTCATCTG Indeksi koostamine: sõna pikkus 15, kombinatsioonide arv 415 = 109 = 4 bytes unikaalsete variantide arv = 162 000 15-mer ID location 162 000 x (4 + 4) bytes = = 1.3 GB
Using this binary representation,we can process the DNA sequence using some of the bit operations used in computer science.For example,a left-shift operation adding 1 or 0,depending on the new nucleotide read in,will give the next N-mer DNA (Fig.5A).Other operations to facilitate rapid searches of,say,complementary sequences should also be possible,although this was not explored in the present study. The N-mers were then placed in a binary tree (Fig.5B)as tree nodes,along with their chromosome ID, contig ID, sequence position on the contig,and their occurrence count and links to left child and right child,respectively,for subse- quently encountered N-mers with the same row value but a larger or smaller column value (Fig.5B).By traversing every tree node after the genome scanning was completed,all UMs of length N along with their genomic location were identified; they are the nodes with the occurrence count equal to one.
Bioinformaatika rakendusi GenomeTester TTTTTTAAAAGAGAAAAAATTCTGACGGGGGCATAACTGGAGAATAAAGTGATAAAATACTGCTGAAACAAAAAGTCATCTG Indeksi koostamine: sõna pikkus 16 kombinatsioonide arv 416 = 4*109 sõnade koguarv 3*109 , 4+4 bytes each = 24 GB 16-mer ID location
Figure 2. GenomeTester is signficantly faster for the ‘genome test’ than any other program. The ‘genome test’ here means finding locations of all primers (16 nt. from the 3’ end) in the human genome and calculation of possible PCR products. Tests were performed on PC-Linux based server, Pentium III, 2 GB RAM, SCSI-RAID0 hard drives. BLAST and MEGABLAST were used without dust filter, word length was 12 for MEGABLAST and 10 for SSAHA.