Presentation is loading. Please wait.

Presentation is loading. Please wait.

Integrating Data for Analysis, Anonymization, and SHaring Supported by the NIH Grant U54HL108460 to the University of California, San Diego HUGO: Hierarchical.

Similar presentations


Presentation on theme: "Integrating Data for Analysis, Anonymization, and SHaring Supported by the NIH Grant U54HL108460 to the University of California, San Diego HUGO: Hierarchical."— Presentation transcript:

1 integrating Data for Analysis, Anonymization, and SHaring Supported by the NIH Grant U54HL108460 to the University of California, San Diego HUGO: Hierarchical mUlti-reference Genome cOmpression tool for aligned short reads Pinghao Li, 1 Xiaoqian Jiang, 2 Shuang Wang, 2 Jihoon Kim, 2 Hongkai Xiong, 1 and Lucila Ohno-Machado 2 1 EE Department, Shanghai Jiaotong University, Shanghai, China 2 Division of Biomedical Informatics, University of California–San Diego, La Jolla, California, USA Introduction HUGO framework Experimental Results Summary of Conclusions Storage and transmission are important challenges in the use of large sequencing ‘Big Data’. We developed a novel compression technique, the HUGO framework, for compressing aligned reads. Our method also presents an innovative way of hierarchically matching gradually shortened reads in order to make full use of available reference genomes. Our experiments compared the performance of our algorithm with other state-of-the-art compression algorithms, such as CRAM, to which ours was superior, and Samcomp, which had similar compression performance. Short-read sequencing is becoming the standard of practice for the study of structural variants associated with disease. However, with the growth of sequence data largely surpassing reasonable storage capability, the biomedical community is challenged with the management, transfer, archiving, and storage of sequence data. We developed Hierarchical mUlti-reference Genome cOmpression (HUGO) [1], a novel compression algorithm for aligned reads in the Sequence Alignment/Map (SAM) format. We first aligned short reads against a reference genome and stored exactly mapped reads for compression. For the inexact mapped or unmapped reads, we realigned them against different reference genomes using an adaptive scheme by gradually shortening the read length. Regarding the base quality value, we offer lossy and lossless compression mechanisms. The lossy compression mechanism for the base quality values uses k-means clustering, where a user can adjust the balance between decompression quality and compression rate. The lossless compression can be produced by setting k (the number of clusters) to the number of different quality values. References Image source: http://www.ncbi.nlm.nih.gov/Traces/sra/i/g.png Methodology Child Reference from Mother Reference from Father EMR: exact mapped read IMR: inexact mapped reads (with less than 4 mismatches) UMR: unmapped reads (with more 4 mismatches) The compression using k-mean clustering followed by bzip2, where the quantization error is measured by Mean Absolute Percentage Error (MAPE) IDSequence NameBAM sizeSAM size 1NA12878chrom20356MB1.58GB 2HG00096chrom11661MB2.65GB 3HG00103chrom11717MB2.91GB 4HG01028chrom11964MB3.95GB 5NA06984chrom111.19GB5.16GB 6NA06985chrom112.33GB9.41GB Encoding Memory usage Decoding Memory usage IDProgram nameReferenceBAM sizeCompressed size 3 bzip2hg19 717 MB 720MB CRAM[2]hg19453.6MB Samcomp[3]hg19349MB HUGOhg19392.5MB HUGOhg19, HuRef390.2MB 4 bzip2hg19 964 MB 967 MB CRAMhg19585MB Samcomphg19480MB HUGOhg19548.1MB HUGOhg19, HuRef545.3MB 5 bzip2hg19 1.19 GB 1.192GB CRAMhg19736.8MB Samcomphg19538MB HUGOhg19653.1MB HUGOhg19, HuRef650.2MB 6 bzip2hg19 2.33 GB 2.34GB CRAMhg191570MB Samcomphg191247MB HUGOhg191496MB HUGOhg19, HuRef1491MB HUGO Lossless with multi-reference [1] Li, P., Jiang, X., Wang, S., Kim, J., Xiong, H., Ohno-Machado, L.. HUGO: Hierarchical mUlti- reference Genome cOmpression for aligned reads. Journal of the American Medical Informatics Association, 2013;0:1–11. doi:10.1136/amiajnl-2013-002147 [2] Fritz MH-Y, Leinonen R, Cochrane G, et al. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 2011;21:734–40. [3] Bonfield JK, Mahoney MV. Compression of FASTQ and SAM format sequencing data. PloS ONE 2013;8:e59190.


Download ppt "Integrating Data for Analysis, Anonymization, and SHaring Supported by the NIH Grant U54HL108460 to the University of California, San Diego HUGO: Hierarchical."

Similar presentations


Ads by Google