Effective Parallel Multicore-optimized K-mers Counting Algorithm

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

Overview of MapReduce and Hadoop

Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.

Next Generation Sequencing, Assembly, and Alignment Methods

Felix Halim, Roland H.C. Yap, Yongzheng Wu

Dale Beach, Longwood University Lisa Scheifele, Loyola University Maryland.

The Protein Folding Problem David van der Spoel Dept. of Cell & Mol. Biology Uppsala, Sweden

Dawei Lin, Ph.D. Director, Bioinformatics Core UC Davis Genome Center July 20, 2008, SLIMS (Solexa sequencing.

Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –

Beowulf Cluster Computing Each Computer in the cluster is equipped with: – Intel Core 2 Duo 6400 Processor(Master: Core 2 Duo 6700) – 2 Gigabytes of DDR.

High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.

MICHAEL MORRA CSE 4939W Detection of Transcription Factor Binding Sites.

Condor Project Computer Sciences Department University of Wisconsin-Madison Running Map-Reduce Under Condor.

Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

ww w.p ost ers essi on. co m E quipped with latest high end computing systems for providing wide range of services.

Copyright © 2011 Partek Incorporated. All rights reserved. Statistics Visualizations Annotations Start-to-Finish Analysis of Integrated Genomics.

De-novo Assembly Day 4.

Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.

CS 394C March 19, 2012 Tandy Warnow.

A performance analysis of multicore computer architectures Michel Schelske.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

Human SNPs from short reads in hours using cloud computing Ben Langmead 1, 2, Michael C. Schatz 2, Jimmy Lin 3, Mihai Pop 2, Steven L. Salzberg 2 1 Department.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Sobolev Showcase Computational Mathematics and Imaging Lab.

KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson.

1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.

Probe Design Using Exact Repeat Count August 8th, 2007 Aaron Arvey.

MapReduce How to painlessly process terabytes of data.

Metagenomics Assembly Hubert DENISE

DNA Bases. Adenine: Adenine: (A) pairs with Thymine (T) only.

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

Jan Pačes Institute of Molecular Genetics AS CR

Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.

RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.

billion-piece genome puzzle

University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.

CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.

Running Mantevo Benchmark on a Bare-metal Server Mohammad H. Mofrad January 28, 2016

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

Computer System Evolution. Yesterday’s Computers filled Rooms IBM Selective Sequence Electroinic Calculator, 1948.

MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.

When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

Gaëtan BENOIT PHD student - ANR Hydrogen

29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.

1 Parallel Mining of Closed Sequential Patterns Shengnan Cong, Jiawei Han, David Padua Proceeding of the 11th ACM SIGKDD international conference on Knowledge.

Canadian Bioinformatics Workshops

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

Brief introduction about “Grid at LNS”

NFV Compute Acceleration APIs and Evaluation

Large-scale file systems and Map-Reduce

The FASTQ format and quality control

WORKFLOW PETRI NETS USED IN MODELING OF PARALLEL ARCHITECTURES

The Basics of Apache Hadoop

Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses Nagakishore Jammula, Sriram P. Chockalingam,

CS 598AGB Genome Assembly Tandy Warnow.

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

Searching the Genome Brian Cain.

TRC: Trace – Reference Compression

Roye Rozov Shamir group meeting 3/7/13

Run time performance for all benchmarked software.

Presentation transcript:

Effective Parallel Multicore-optimized K-mers Counting Algorithm notes notes Tomáš Farkaš, Peter Kubán, Mária Lucká Harrachov, Czech Republic Slovak University of Technology Faculty of Informatics and Information Technologies

Content Bioinformatics & k-mers K-mers in bioinformatics K-mers counting methods/algorithms/software Algorithm preview Preprocessing Sorting Counting Experiments and results Next (ongoing) work at first shortly briefly

Bioinformatics & k-mers {A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” ATGGAAATGGAATAATC GAATCACGTAAACTTCG GGGGGTAAACGTTCTTA TTGGAAGTCGCGGAATC AATCATGGAAGGTTCTT CCGGAAGTCGTTAAACG ATGGAAGTCGCGGAATC Palpanas

Bioinformatics & k-mers 17 {A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” ATGGAAATGGAAAGGTC

Bioinformatics & k-mers 17 {A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” ATGGAAATGGAAAGGTC ATGGAAA TGGAAAT GGAAATG GAAATGG AAATGGA AATGGAA ATGGAAA TGGAAAG GGAAAGG GAAAGGT AAAGGTC 11 7-mers k-mers in read = read length – k + 1

Bioinformatics & k-mers 17 {A, C, G , T} adenine (A), cytosine (C), guanine (G), thymine (T) reads, sequences, ... k-mer: substring of length k counting k-mers k-mers frequencies “trivial task” => but 4k ATGGAAATGGAAAGGTC ATGGAAA TGGAAAT GGAAATG GAAATGG AAATGGA AATGGAA ATGGAAA TGGAAAG GGAAAGG GAAAGGT AAAGGTC 2x 1x Palpanas (in his presentation "Big Sequence Management") 4 to the power of k k-mers in read = read length – k + 1

K-mers in bioinformatics de novo assembly error correction (reads, sequences) repeat sequences detection finding mutations multiple sequence alignment ... de novo assembly - De Bruijn graph

K-mers counting methods/algorithms/software different methods memory-based, disk-based, ... using bloom filter, sorting, ... Tallymer (2008) Jellyfish (2011) BFCounter (2011) DSK (2012) MSPKmerCounter (2013) KMC (2013), KMC 2 (2014) KAnalyze (2014) Khmer (2014) Turtle (2014) Our algorithm(2015) ...

Algorithm preview Parallelism (OpenMPI, POSIX) Preprocessing Sorting k-mers generation partitions identification Sorting k-mers distribution Nested Bucket sort algorithm Counting using combination of OpenMPI and POSIX threads our algorithm can be spit into 3 stages or phases

Preprocessing AAAAAACCCAAACGGC k-mers generation reads => k-mers encoding k-mers {A = 00, C = 01, G = 10, T = 10} 64-bit number (k <= 32) partitions identification N highest bits cumulative count k-mers distribution* AAAAAACCCAAACGGC 0000000000000101010000000110100100000000000000000000000000000000 1478194599297024

Preprocessing AAAAAACCCAAACGGC AAAAACCTCAAAGGGC AAAAAGTGCACACCAC . . AACGAAGGCCCTTACA AACGAAGGCCCTTACA AACGAAGGCCGGGTTT . TTTTAAACGCCTAACTA TTTTAACCGTCTTCTATT TTTTATCGCTATTCTACT k-mers generation reads => k-mers encoding k-mers {A = 00, C = 01, G = 10, T = 10} 64-bit number (k <= 32) partitions identification N highest bits cumulative count k-mers distribution*

Sorting masters, workers k-mers distribution* Bucket sort => nested Bucket sort

Sorting masters, workers k-mers distribution* Bucket sort => nested Bucket sort

Sorting 100M elements 1st pass - for counting the elements in groups to be created, cumulative counts form group starting indices. 2nd pass - represents time needed for actual shifting the elements into groups, sorting indicates the time required for sorting the newly created groups. Farkaš T.: Parallel Bucket sort algorithm for ordering short DNA sequences. In IIT.SRC 2015: Student research conference. 11th Students research conference in informatics and information technologies Bratislava, pp. 77-82. ISBN 978-80-227-4342-6 (2015)

Counting sorted k-mers

Experiments and results Drosophila melanogaster (common fruit fly) Illumina Genome Analyzer II (SRX040485) genome size: 139.5 MB reads: # 48 432 878 (30-fold), length 76 Dataset FASTQ size (GB) # reads # unique k-mers # distinct k-mers total # k-mers subset 1 2,1 10 000 000 140 948 976 199 840 105 458 926 804 subset 2 4,1 20 000 000 187 804 272 272 601 565 906 605 778 subset 3 6,1 30 000 000 223 636 699 321 865 881 1 364 337 258 subset 4 8,1 40 000 000 259 588 501 365 877 941 1 821 845 026 whole data 9,8 48 432 878 289 202 942 394 953 130 2 207 175 063

Experiments and results how to compare? 1 node limited by available memory # reads \ time[s] Jellyfish KMC 2 BFCounter Our algorithm 10 000 000 17,1 7,4 257 8,1 20 000 000 32,4 14,3 443 12,4 30 000 000 48,9 21,4 688 17,2 40 000 000 54,5 28,9 887 22,3 48 432 287 61,2 36,8 980 26,5

Experiments and results

Experiments and results IBM iDataPlex dx360 M3 52 computing nodes connected on the high-speed network: 2x10Gb/s Ethernet (RoCE) CPU 2x 6 cores Intel Xeon X5670 2.93 GHz operating memory of 48GB (24GB per processor/socket, NUMA architecture) per node. disks are 1x 2TB 7200rpm SATA per node. Scientific Linux 6.4 (kernel 2.6.32-358.el6) OpenMPI (version 1.6.5.)

Next (ongoing) work more nodes not limited by available memory improve partitioning test on larger datasets ... available to download

Tomáš Farkaš, Peter Kubán, Mária Lucká Thank you Questions? Effective Parallel Multicore-optimized K-mers Counting Algorithm Tomáš Farkaš, Peter Kubán, Mária Lucká