Effective Parallel Multicore-optimized K-mers Counting Algorithm notes notes Tomáš Farkaš, Peter Kubán, Mária Lucká Harrachov, Czech Republic Slovak University of Technology Faculty of Informatics and Information Technologies
Content Bioinformatics & k-mers K-mers in bioinformatics K-mers counting methods/algorithms/software Algorithm preview Preprocessing Sorting Counting Experiments and results Next (ongoing) work at first shortly briefly
Bioinformatics & k-mers {A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” ATGGAAATGGAATAATC GAATCACGTAAACTTCG GGGGGTAAACGTTCTTA TTGGAAGTCGCGGAATC AATCATGGAAGGTTCTT CCGGAAGTCGTTAAACG ATGGAAGTCGCGGAATC Palpanas
Bioinformatics & k-mers 17 {A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” ATGGAAATGGAAAGGTC
Bioinformatics & k-mers 17 {A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” ATGGAAATGGAAAGGTC ATGGAAA TGGAAAT GGAAATG GAAATGG AAATGGA AATGGAA ATGGAAA TGGAAAG GGAAAGG GAAAGGT AAAGGTC 11 7-mers k-mers in read = read length – k + 1
Bioinformatics & k-mers 17 {A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” => but 4k ATGGAAATGGAAAGGTC ATGGAAA TGGAAAT GGAAATG GAAATGG AAATGGA AATGGAA ATGGAAA TGGAAAG GGAAAGG GAAAGGT AAAGGTC 2x 1x Palpanas (in his presentation "Big Sequence Management") 4 to the power of k k-mers in read = read length – k + 1
K-mers in bioinformatics de novo assembly error correction (reads, sequences) repeat sequences detection finding mutations multiple sequence alignment ... de novo assembly - De Bruijn graph
K-mers counting methods/algorithms/software different methods memory-based, disk-based, ... using bloom filter, sorting, ... Tallymer (2008) Jellyfish (2011) BFCounter (2011) DSK (2012) MSPKmerCounter (2013) KMC (2013), KMC 2 (2014) KAnalyze (2014) Khmer (2014) Turtle (2014) Our algorithm(2015) ...
Algorithm preview Parallelism (OpenMPI, POSIX) Preprocessing Sorting k-mers generation partitions identification Sorting k-mers distribution Nested Bucket sort algorithm Counting using combination of OpenMPI and POSIX threads our algorithm can be spit into 3 stages or phases
Preprocessing AAAAAACCCAAACGGC k-mers generation reads => k-mers encoding k-mers {A = 00, C = 01, G = 10, T = 10} 64-bit number (k <= 32) partitions identification N highest bits cumulative count k-mers distribution* AAAAAACCCAAACGGC 0000000000000101010000000110100100000000000000000000000000000000 1478194599297024
Preprocessing AAAAAACCCAAACGGC AAAAACCTCAAAGGGC AAAAAGTGCACACCAC . . AACGAAGGCCCTTACA AACGAAGGCCCTTACA AACGAAGGCCGGGTTT . TTTTAAACGCCTAACTA TTTTAACCGTCTTCTATT TTTTATCGCTATTCTACT k-mers generation reads => k-mers encoding k-mers {A = 00, C = 01, G = 10, T = 10} 64-bit number (k <= 32) partitions identification N highest bits cumulative count k-mers distribution*
Sorting masters, workers k-mers distribution* Bucket sort => nested Bucket sort
Sorting masters, workers k-mers distribution* Bucket sort => nested Bucket sort
Sorting 100M elements 1st pass - for counting the elements in groups to be created, cumulative counts form group starting indices. 2nd pass - represents time needed for actual shifting the elements into groups, sorting indicates the time required for sorting the newly created groups. Farkaš T.: Parallel Bucket sort algorithm for ordering short DNA sequences. In IIT.SRC 2015: Student research conference. 11th Students research conference in informatics and information technologies Bratislava, pp. 77-82. ISBN 978-80-227-4342-6 (2015)
Counting sorted k-mers
Experiments and results Drosophila melanogaster (common fruit fly) Illumina Genome Analyzer II (SRX040485) genome size: 139.5 MB reads: # 48 432 878 (30-fold), length 76 Dataset FASTQ size (GB) # reads # unique k-mers # distinct k-mers total # k-mers subset 1 2,1 10 000 000 140 948 976 199 840 105 458 926 804 subset 2 4,1 20 000 000 187 804 272 272 601 565 906 605 778 subset 3 6,1 30 000 000 223 636 699 321 865 881 1 364 337 258 subset 4 8,1 40 000 000 259 588 501 365 877 941 1 821 845 026 whole data 9,8 48 432 878 289 202 942 394 953 130 2 207 175 063
Experiments and results how to compare? 1 node limited by available memory # reads \ time[s] Jellyfish KMC 2 BFCounter Our algorithm 10 000 000 17,1 7,4 257 8,1 20 000 000 32,4 14,3 443 12,4 30 000 000 48,9 21,4 688 17,2 40 000 000 54,5 28,9 887 22,3 48 432 287 61,2 36,8 980 26,5
Experiments and results
Experiments and results IBM iDataPlex dx360 M3 52 computing nodes connected on the high-speed network: 2x10Gb/s Ethernet (RoCE) CPU 2x 6 cores Intel Xeon X5670 2.93 GHz operating memory of 48GB (24GB per processor/socket, NUMA architecture) per node. disks are 1x 2TB 7200rpm SATA per node. Scientific Linux 6.4 (kernel 2.6.32-358.el6) OpenMPI (version 1.6.5.)
Next (ongoing) work more nodes not limited by available memory improve partitioning test on larger datasets ... available to download
Tomáš Farkaš, Peter Kubán, Mária Lucká Thank you Questions? Effective Parallel Multicore-optimized K-mers Counting Algorithm Tomáš Farkaš, Peter Kubán, Mária Lucká