Download presentation
Published byAntony Lynch Modified over 9 years ago
1
Effective Parallel Multicore-optimized K-mers Counting Algorithm
notes notes Tomáš Farkaš, Peter Kubán, Mária Lucká Harrachov, Czech Republic Slovak University of Technology Faculty of Informatics and Information Technologies
2
Content Bioinformatics & k-mers K-mers in bioinformatics
K-mers counting methods/algorithms/software Algorithm preview Preprocessing Sorting Counting Experiments and results Next (ongoing) work at first shortly briefly
3
Bioinformatics & k-mers
{A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” ATGGAAATGGAATAATC GAATCACGTAAACTTCG GGGGGTAAACGTTCTTA TTGGAAGTCGCGGAATC AATCATGGAAGGTTCTT CCGGAAGTCGTTAAACG ATGGAAGTCGCGGAATC Palpanas
4
Bioinformatics & k-mers
17 {A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” ATGGAAATGGAAAGGTC
5
Bioinformatics & k-mers
17 {A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” ATGGAAATGGAAAGGTC ATGGAAA TGGAAAT GGAAATG GAAATGG AAATGGA AATGGAA ATGGAAA TGGAAAG GGAAAGG GAAAGGT AAAGGTC 11 7-mers k-mers in read = read length – k + 1
6
Bioinformatics & k-mers
17 {A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” => but 4k ATGGAAATGGAAAGGTC ATGGAAA TGGAAAT GGAAATG GAAATGG AAATGGA AATGGAA ATGGAAA TGGAAAG GGAAAGG GAAAGGT AAAGGTC 2x 1x Palpanas (in his presentation "Big Sequence Management") 4 to the power of k k-mers in read = read length – k + 1
7
K-mers in bioinformatics
de novo assembly error correction (reads, sequences) repeat sequences detection finding mutations multiple sequence alignment ... de novo assembly - De Bruijn graph
8
K-mers counting methods/algorithms/software
different methods memory-based, disk-based, ... using bloom filter, sorting, ... Tallymer (2008) Jellyfish (2011) BFCounter (2011) DSK (2012) MSPKmerCounter (2013) KMC (2013), KMC 2 (2014) KAnalyze (2014) Khmer (2014) Turtle (2014) Our algorithm(2015) ...
9
Algorithm preview Parallelism (OpenMPI, POSIX) Preprocessing Sorting
k-mers generation partitions identification Sorting k-mers distribution Nested Bucket sort algorithm Counting using combination of OpenMPI and POSIX threads our algorithm can be spit into 3 stages or phases
10
Preprocessing AAAAAACCCAAACGGC
k-mers generation reads => k-mers encoding k-mers {A = 00, C = 01, G = 10, T = 10} 64-bit number (k <= 32) partitions identification N highest bits cumulative count k-mers distribution* AAAAAACCCAAACGGC
11
Preprocessing AAAAAACCCAAACGGC AAAAACCTCAAAGGGC AAAAAGTGCACACCAC . .
AACGAAGGCCCTTACA AACGAAGGCCCTTACA AACGAAGGCCGGGTTT . TTTTAAACGCCTAACTA TTTTAACCGTCTTCTATT TTTTATCGCTATTCTACT k-mers generation reads => k-mers encoding k-mers {A = 00, C = 01, G = 10, T = 10} 64-bit number (k <= 32) partitions identification N highest bits cumulative count k-mers distribution*
12
Sorting masters, workers k-mers distribution*
Bucket sort => nested Bucket sort
13
Sorting masters, workers k-mers distribution*
Bucket sort => nested Bucket sort
14
Sorting 100M elements 1st pass - for counting the elements in groups to be created, cumulative counts form group starting indices. 2nd pass - represents time needed for actual shifting the elements into groups, sorting indicates the time required for sorting the newly created groups. Farkaš T.: Parallel Bucket sort algorithm for ordering short DNA sequences. In IIT.SRC 2015: Student research conference. 11th Students research conference in informatics and information technologies Bratislava, pp ISBN (2015)
15
Counting sorted k-mers
16
Experiments and results
Drosophila melanogaster (common fruit fly) Illumina Genome Analyzer II (SRX040485) genome size: MB reads: # (30-fold), length 76 Dataset FASTQ size (GB) # reads # unique k-mers # distinct k-mers total # k-mers subset 1 2,1 subset 2 4,1 subset 3 6,1 subset 4 8,1 whole data 9,8
17
Experiments and results
how to compare? 1 node limited by available memory # reads \ time[s] Jellyfish KMC 2 BFCounter Our algorithm 17,1 7,4 257 8,1 32,4 14,3 443 12,4 48,9 21,4 688 17,2 54,5 28,9 887 22,3 61,2 36,8 980 26,5
18
Experiments and results
19
Experiments and results
IBM iDataPlex dx360 M3 52 computing nodes connected on the high-speed network: 2x10Gb/s Ethernet (RoCE) CPU 2x 6 cores Intel Xeon X GHz operating memory of 48GB (24GB per processor/socket, NUMA architecture) per node. disks are 1x 2TB 7200rpm SATA per node. Scientific Linux 6.4 (kernel el6) OpenMPI (version )
20
Next (ongoing) work more nodes not limited by available memory
improve partitioning test on larger datasets ... available to download
21
Tomáš Farkaš, Peter Kubán, Mária Lucká
Thank you Questions? Effective Parallel Multicore-optimized K-mers Counting Algorithm Tomáš Farkaš, Peter Kubán, Mária Lucká
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.