Computing challenges in working with genomics-scale data

1 Computing challenges in working with genomics-scale data
Le-Shin Wu, Ph.D. Carrie Ganote National Center for Genome Analysis Support Genomics in July, July 23 , 2014

Summary Computing Challenges Data Pre-Processing Software Solutions
Hardware Solutions

3 Computing Challenges Data is big Resources are limited
Because we see there are more and more people working on these genomics dat

DNA Sequencing Costs a long-term trend in the computer hardware industry that involves the doubling of 'compute power' every two years the cost of determining one megabase (Mb; a million bases) of DNA sequence of a specified quality

5 Users and Data Growth July 22 , 2014

6 Data Pre-Processing Trimming and Quality Filtering FastQC Trimomatic
Normalization Digital Normalization Clustering CD-HIT Change file format Sam to Bam, Fastq to Fasta The purpose of the data pre-processing is to reduce the overall size of the input

7 Software Solution Choose the Correct Parameters
Reduce the size of outputs e-value, identity percentage, … Reduce the number of output files No log file, no summery file Match the hardware Number of CPUs (threads), memory size, …

8 Software Solutions Parallelization
Split single job into many small jobs Use multi-threading MPI wrapper

9 Blastp on BigRed 2 July 22 , 2014

10 Software Solution Use Different Tools Choose the right tools
Avoid using blastx, tblastx Alternative tools Assembler: Trinity, Trans-Soap, Trans-Abyss, Velet-Oasis, … Alignment: bowtie, bwa, novoalign, … Search: blast, blat, smith-waterman

Different Assemblers memory (kb) time (sec.)

12 Hardware Solution Choose the right computing resources
CPU bound, IO bound, memory, GPU, OSG, … Best using computing resources /dev/shm Machine specific compiler

13 Blastp on Different Clusters
Blastp on Tair10 Total length of sequence: ,729,993 bp Total number of sequences: 39,687

14 Miscellaneous Implement check points Code optimization Job arrays
Reuse the outputs When you want to try multiple kmer size

Thank You Le-Shin Wu Carrie Ganote NCGAS

