Effective Parallel Multicore-optimized K-mers Counting Algorithm

Name: Effective Parallel Multicore-optimized K-mers Counting Algorithm
Uploaded: 2017-10-08T01:53:58+00:00
Duration: PTM12S50
Channel: Antony Lynch
Description: Effective Parallel Multicore-optimized K-mers Counting Algorithm

Effective Parallel Multicore-optimized K-mers Counting Algorithm
notes notes Tomáš Farkaš, Peter Kubán, Mária Lucká Harrachov, Czech Republic Slovak University of Technology Faculty of Informatics and Information Technologies

Content Bioinformatics & k-mers K-mers in bioinformatics
K-mers counting methods/algorithms/software Algorithm preview Preprocessing Sorting Counting Experiments and results Next (ongoing) work at first shortly briefly

Bioinformatics & k-mers
{A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” ATGGAAATGGAATAATC GAATCACGTAAACTTCG GGGGGTAAACGTTCTTA TTGGAAGTCGCGGAATC AATCATGGAAGGTTCTT CCGGAAGTCGTTAAACG ATGGAAGTCGCGGAATC Palpanas

17 {A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” ATGGAAATGGAAAGGTC

17 {A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” ATGGAAATGGAAAGGTC ATGGAAA TGGAAAT GGAAATG GAAATGG AAATGGA AATGGAA ATGGAAA TGGAAAG GGAAAGG GAAAGGT AAAGGTC 11 7-mers k-mers in read = read length – k + 1

17 {A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” => but 4k ATGGAAATGGAAAGGTC ATGGAAA TGGAAAT GGAAATG GAAATGG AAATGGA AATGGAA ATGGAAA TGGAAAG GGAAAGG GAAAGGT AAAGGTC 2x 1x Palpanas (in his presentation "Big Sequence Management") 4 to the power of k k-mers in read = read length – k + 1

K-mers in bioinformatics
de novo assembly error correction (reads, sequences) repeat sequences detection finding mutations multiple sequence alignment ... de novo assembly - De Bruijn graph

K-mers counting methods/algorithms/software
different methods memory-based, disk-based, ... using bloom filter, sorting, ... Tallymer (2008) Jellyfish (2011) BFCounter (2011) DSK (2012) MSPKmerCounter (2013) KMC (2013), KMC 2 (2014) KAnalyze (2014) Khmer (2014) Turtle (2014) Our algorithm(2015) ...

Algorithm preview Parallelism (OpenMPI, POSIX) Preprocessing Sorting
k-mers generation partitions identification Sorting k-mers distribution Nested Bucket sort algorithm Counting using combination of OpenMPI and POSIX threads our algorithm can be spit into 3 stages or phases

Preprocessing AAAAAACCCAAACGGC
k-mers generation reads => k-mers encoding k-mers {A = 00, C = 01, G = 10, T = 10} 64-bit number (k <= 32) partitions identification N highest bits cumulative count k-mers distribution* AAAAAACCCAAACGGC

Preprocessing AAAAAACCCAAACGGC AAAAACCTCAAAGGGC AAAAAGTGCACACCAC . .
AACGAAGGCCCTTACA AACGAAGGCCCTTACA AACGAAGGCCGGGTTT . TTTTAAACGCCTAACTA TTTTAACCGTCTTCTATT TTTTATCGCTATTCTACT k-mers generation reads => k-mers encoding k-mers {A = 00, C = 01, G = 10, T = 10} 64-bit number (k <= 32) partitions identification N highest bits cumulative count k-mers distribution*

Sorting masters, workers k-mers distribution*
Bucket sort => nested Bucket sort

Sorting 100M elements 1st pass - for counting the elements in groups to be created, cumulative counts form group starting indices. 2nd pass - represents time needed for actual shifting the elements into groups, sorting indicates the time required for sorting the newly created groups. Farkaš T.: Parallel Bucket sort algorithm for ordering short DNA sequences. In IIT.SRC 2015: Student research conference. 11th Students research conference in informatics and information technologies Bratislava, pp ISBN (2015)

Counting sorted k-mers

Experiments and results
Drosophila melanogaster (common fruit fly) Illumina Genome Analyzer II (SRX040485) genome size: MB reads: # (30-fold), length 76 Dataset FASTQ size (GB) # reads # unique k-mers # distinct k-mers total # k-mers subset 1 2,1 subset 2 4,1 subset 3 6,1 subset 4 8,1 whole data 9,8

how to compare? 1 node limited by available memory # reads \ time[s] Jellyfish KMC 2 BFCounter Our algorithm 17,1 7,4 257 8,1 32,4 14,3 443 12,4 48,9 21,4 688 17,2 54,5 28,9 887 22,3 61,2 36,8 980 26,5

IBM iDataPlex dx360 M3 52 computing nodes connected on the high-speed network: 2x10Gb/s Ethernet (RoCE) CPU 2x 6 cores Intel Xeon X GHz operating memory of 48GB (24GB per processor/socket, NUMA architecture) per node. disks are 1x 2TB 7200rpm SATA per node. Scientific Linux 6.4 (kernel el6) OpenMPI (version )

Next (ongoing) work more nodes not limited by available memory
improve partitioning test on larger datasets ... available to download

Tomáš Farkaš, Peter Kubán, Mária Lucká
Thank you Questions? Effective Parallel Multicore-optimized K-mers Counting Algorithm Tomáš Farkaš, Peter Kubán, Mária Lucká

Effective Parallel Multicore-optimized K-mers Counting Algorithm

Similar presentations

Presentation on theme: "Effective Parallel Multicore-optimized K-mers Counting Algorithm"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Effective Parallel Multicore-optimized K-mers Counting Algorithm

Similar presentations

Presentation on theme: "Effective Parallel Multicore-optimized K-mers Counting Algorithm"— Presentation transcript:

Similar presentations

About project

Feedback