Download presentation
Presentation is loading. Please wait.
Published byEmily Pitts Modified over 9 years ago
1
EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin
2
Data horror EMBL-EBI 10 petabytes SRA ~1 petabytes Over 2 million DVDs or 2.5km Complete Genomics 0.5 TB for a single file
3
The need for compression Red alert
4
Compression, what is it? BMP, 190 kbPNG, 100 kbJPG, 21 kbJPG, 4 kb LOSSLESSLOSSY
5
Compression, when we know what to expect. BMP, 145 kbPNG, 2 kbJPG, 6 kbJPG, 3 kb LOSSLESSLOSSY But the actual message is only 40 characters (bytes) long!
6
Compression at it’s best IMAGE, 145 kb "Five little ducks went swimming one day" TEXT, 40 bIMAGE, 145 kb ~3500 times more efficient compressuncompress
7
What are we talking about sample sequencing machines bug bunch of huge files The bug’s DNA is hidden somewhere
8
Looking closer at the data bunch of huge files read 1 read 2 read 3 ….. read bizzilion It boils down to a long list of reads: Each read represents a short nucleotide sequence from the genome. Additional information may be attached to it, for example error estimates.
9
What is a Read? @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file.
10
What is a Read? @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read name An excerpt from of a FASTQ file.
11
What is a Read? @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read nameread bases An excerpt from of a FASTQ file. Bases: ACGTN
12
What is a Read? @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read nameread bases read quality scores An excerpt from of a FASTQ file. Bases: ACGTN Quality scores: from ‘!’ (ASCII 33) to ‘~’ (ASCII 126)
13
What is quality score? Then quality score is phred quality score encoded as ASCII symbols 33-126. Basically: higher scores are better, so ‘!’ is bad, ‘I’ is good.
14
Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read 1 TGAGCTCTTAGTAGC read 2 GCTCTAAGTAGCCGC read 3 CTCTAAGTAGCCGCG read 4 GTAGCCGCGGACTGT read 5 CGGTCTGTCCG Read start positionRead end position
15
Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read 1........T...... read 2............... read 3............... read 4..........A.... read 5...........
16
Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read 1........T...... read 2............... read 3............... read 4..........A.... read 5........... Mismatching bases
17
Lossy quality scores Approach 1 Quality scores are usually values from 0 to 39. Let’s shrink them, so that they are from 0 to 7 now. Approach 2 Let’s treat quality scores using alignment information. For example: preserve only quality scores for mismatching bases. horizontal vertical
18
Comparison study:1K Genomes exomes compressuncompress BAM CRAM
19
compressuncompress Comparison study:1K Genomes exomes BAM CRAM Some analysis pipeline
20
compressuncompress Comparison study:1K Genomes exomes BAM CRAM Some analysis pipeline Original SNPsRestored SNPs
21
Comparison study:1K Genomes exomes
22
CRAM NGS data compression Do nothing CRAM lossy Untreated CRAM very lossy Lossless Lossy Bits/base CRAM lossless (bad)(good)
23
Progressive application of compression Sample value Sample accessibility 200-foldLossless2-fold20-fold Hard High Easy Low
24
References More information: http://www.ebi.ac.uk/ena/about/cram_toolkit Mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/cram-dev Publications: Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), 734-40 Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. Gigascience 1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.