EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin
Data horror EMBL-EBI 10 petabytes SRA ~1 petabytes Over 2 million DVDs or 2.5km Complete Genomics 0.5 TB for a single file
The need for compression Red alert
Compression, what is it? BMP, 190 kbPNG, 100 kbJPG, 21 kbJPG, 4 kb LOSSLESSLOSSY
Compression, when we know what to expect. BMP, 145 kbPNG, 2 kbJPG, 6 kbJPG, 3 kb LOSSLESSLOSSY But the actual message is only 40 characters (bytes) long!
Compression at it’s best IMAGE, 145 kb "Five little ducks went swimming one day" TEXT, 40 bIMAGE, 145 kb ~3500 times more efficient compressuncompress
What are we talking about sample sequencing machines bug bunch of huge files The bug’s DNA is hidden somewhere
Looking closer at the data bunch of huge files read 1 read 2 read 3 ….. read bizzilion It boils down to a long list of reads: Each read represents a short nucleotide sequence from the genome. Additional information may be attached to it, for example error estimates.
What is a CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file.
What is a CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read name An excerpt from of a FASTQ file.
What is a CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read nameread bases An excerpt from of a FASTQ file. Bases: ACGTN
What is a CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read nameread bases read quality scores An excerpt from of a FASTQ file. Bases: ACGTN Quality scores: from ‘!’ (ASCII 33) to ‘~’ (ASCII 126)
What is quality score? Then quality score is phred quality score encoded as ASCII symbols Basically: higher scores are better, so ‘!’ is bad, ‘I’ is good.
Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read 1 TGAGCTCTTAGTAGC read 2 GCTCTAAGTAGCCGC read 3 CTCTAAGTAGCCGCG read 4 GTAGCCGCGGACTGT read 5 CGGTCTGTCCG Read start positionRead end position
Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read T read read read A.... read
Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read T read read read A.... read Mismatching bases
Lossy quality scores Approach 1 Quality scores are usually values from 0 to 39. Let’s shrink them, so that they are from 0 to 7 now. Approach 2 Let’s treat quality scores using alignment information. For example: preserve only quality scores for mismatching bases. horizontal vertical
Comparison study:1K Genomes exomes compressuncompress BAM CRAM
compressuncompress Comparison study:1K Genomes exomes BAM CRAM Some analysis pipeline
compressuncompress Comparison study:1K Genomes exomes BAM CRAM Some analysis pipeline Original SNPsRestored SNPs
Comparison study:1K Genomes exomes
CRAM NGS data compression Do nothing CRAM lossy Untreated CRAM very lossy Lossless Lossy Bits/base CRAM lossless (bad)(good)
Progressive application of compression Sample value Sample accessibility 200-foldLossless2-fold20-fold Hard High Easy Low
References More information: Mailing list: Publications: Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. Gigascience 1