Download presentation
Presentation is loading. Please wait.
Published byPatricia Carter Modified over 7 years ago
1
Computing challenges in working with genomics-scale data
Le-Shin Wu, Ph.D. Carrie Ganote National Center for Genome Analysis Support Genomics in July, July 23 , 2014
2
Summary Computing Challenges Data Pre-Processing Software Solutions
Hardware Solutions July 22 , 2014
3
Computing Challenges Data is big Resources are limited
Deadline is approaching Because we see there are more and more people working on these genomics dat July 22 , 2014
4
DNA Sequencing Costs a long-term trend in the computer hardware industry that involves the doubling of 'compute power' every two years the cost of determining one megabase (Mb; a million bases) of DNA sequence of a specified quality
5
Users and Data Growth July 22 , 2014
6
Data Pre-Processing Trimming and Quality Filtering FastQC Trimomatic
Normalization Digital Normalization Clustering CD-HIT Change file format Sam to Bam, Fastq to Fasta The purpose of the data pre-processing is to reduce the overall size of the input July 22 , 2014
7
Software Solution Choose the Correct Parameters
Reduce the size of outputs e-value, identity percentage, … Reduce the number of output files No log file, no summery file Match the hardware Number of CPUs (threads), memory size, … July 22 , 2014
8
Software Solutions Parallelization
Split single job into many small jobs Use multi-threading MPI wrapper July 22 , 2014
9
Blastp on BigRed 2 July 22 , 2014
10
Software Solution Use Different Tools Choose the right tools
Avoid using blastx, tblastx Alternative tools Assembler: Trinity, Trans-Soap, Trans-Abyss, Velet-Oasis, … Alignment: bowtie, bwa, novoalign, … Search: blast, blat, smith-waterman July 22 , 2014
11
Different Assemblers memory (kb) time (sec.) July 22 , 2014
12
Hardware Solution Choose the right computing resources
CPU bound, IO bound, memory, GPU, OSG, … Best using computing resources /dev/shm Machine specific compiler July 22 , 2014
13
Blastp on Different Clusters
Blastp on Tair10 Total length of sequence: ,729,993 bp Total number of sequences: 39,687 July 22 , 2014
14
Miscellaneous Implement check points Code optimization Job arrays
Reuse the outputs When you want to try multiple kmer size July 22 , 2014
15
Thank You Le-Shin Wu Carrie Ganote NCGAS July 22 , 2014
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.