Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing challenges in working with genomics-scale data

Similar presentations


Presentation on theme: "Computing challenges in working with genomics-scale data"— Presentation transcript:

1 Computing challenges in working with genomics-scale data
Le-Shin Wu, Ph.D. Carrie Ganote National Center for Genome Analysis Support Genomics in July, July 23 , 2014

2 Summary Computing Challenges Data Pre-Processing Software Solutions
Hardware Solutions July 22 , 2014

3 Computing Challenges Data is big Resources are limited
Deadline is approaching Because we see there are more and more people working on these genomics dat July 22 , 2014

4 DNA Sequencing Costs a long-term trend in the computer hardware industry that involves the doubling of 'compute power' every two years the cost of determining one megabase (Mb; a million bases) of DNA sequence of a specified quality

5 Users and Data Growth July 22 , 2014

6 Data Pre-Processing Trimming and Quality Filtering FastQC Trimomatic
Normalization Digital Normalization Clustering CD-HIT Change file format Sam to Bam, Fastq to Fasta The purpose of the data pre-processing is to reduce the overall size of the input July 22 , 2014

7 Software Solution Choose the Correct Parameters
Reduce the size of outputs e-value, identity percentage, … Reduce the number of output files No log file, no summery file Match the hardware Number of CPUs (threads), memory size, … July 22 , 2014

8 Software Solutions Parallelization
Split single job into many small jobs Use multi-threading MPI wrapper July 22 , 2014

9 Blastp on BigRed 2 July 22 , 2014

10 Software Solution Use Different Tools Choose the right tools
Avoid using blastx, tblastx Alternative tools Assembler: Trinity, Trans-Soap, Trans-Abyss, Velet-Oasis, … Alignment: bowtie, bwa, novoalign, … Search: blast, blat, smith-waterman July 22 , 2014

11 Different Assemblers memory (kb) time (sec.) July 22 , 2014

12 Hardware Solution Choose the right computing resources
CPU bound, IO bound, memory, GPU, OSG, … Best using computing resources /dev/shm Machine specific compiler July 22 , 2014

13 Blastp on Different Clusters
Blastp on Tair10 Total length of sequence: ,729,993 bp Total number of sequences: 39,687 July 22 , 2014

14 Miscellaneous Implement check points Code optimization Job arrays
Reuse the outputs When you want to try multiple kmer size July 22 , 2014

15 Thank You Le-Shin Wu Carrie Ganote NCGAS July 22 , 2014


Download ppt "Computing challenges in working with genomics-scale data"

Similar presentations


Ads by Google