EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin.

Slides:



Advertisements
Similar presentations
Entropy and Information Theory
Advertisements

Information Representation
BMP Hide ‘n’ Seek What is BMP Hide ‘n’ Seek ? –It’s a tool that lets you hide text messages in BMP files without much visible change in the picture. –Change.
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
Compression & Huffman Codes
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Data Representation CS105. Data Representation Types of data: – Numbers – Text – Audio – Images & Graphics – Video.
NGS Transcriptomic Workflows Hugh Shanahan & Jamie al-Nasir Royal Holloway, University of London.
How pictures are stored in computers. While it is useful to know how to use picture editing software, Computer Scientists should understand how pictures.
 Wisegeek.com defines Data Compression as:  “Data compression is a general term for a group of technologies that encode large files in order to shrink.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM.
Comparison of image compression algorithms ECE-533 Paula Aguilera.
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Cosc 2150: Computer Organization Chapter 2a Data compression.
Data Compression. How Is This Possible? Entire King James Bible : 4,834,757 bytes Zip Archive Containing It: 1,339,843 bytes.
MES Genome Informatics I - Lecture IV. NGS basics Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University.
File formats Wrapping your data in the right package Deanna M. Church
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
Watermarks.  Four sequences, 1000 bp each  Inserted into noncoding regions of genome  Translated into English using secret triplet nucleotide to character.
Integrating Data for Analysis, Anonymization, and SHaring Supported by the NIH Grant U54HL to the University of California, San Diego HUGO: Hierarchical.
Laboratory for Molecular and Computational Genomics Optical Mapping of E-coli O157:H7 Alex Lim.
Next Generation DNA Sequencing
Next Generation Sequencing. Overview of RNA-seq experimental procedures. Wang L et al. Briefings in Functional Genomics 2010;9: © The Author.
Still-image compression Moving-image compression and File types.
Quick introduction to genomic file types Preliminary quality control (lab)
DM ChurchLast Updated: 7 May 2012 Intro to Next Generation Sequencing.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
Huffman Code and Data Decomposition Pranav Shah CS157B.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
1 Information Management DIG 3563 – Lecture 14 Data Formats J. Michael Moshell University of Central Florida Original image* by Moshell et al. Imagery.
Data Workflow Overview Genomics High- Throughput Facility Genome Analyzer IIx Institute for Genomics and Bioinformatics Computation Resources Storage Capacity.
Introduction to RNAseq
RNA-Seq data analysis Xuhua Xia University of Ottawa
 By Bob “The Bird” Fiske & Anita “The Snail” Cost.
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 11 COMPRESSION.
No reference available
PRESERVATION IN A DIGITAL WORLD Presented By: Darrell Garwood Imaging Lab Manager Library and Archives Division Kansas State Historical Society
ECE 101 An Introduction to Information Technology Information Coding.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Compression by Reference a rational approach to storing aligned sequence data.
Computer Sciences Department1. 2 Data Compression and techniques.
Canadian Bioinformatics Workshops
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Rafael Jimenez ELIXIR CTO BioMedBridges Life science requirements from e-infrastructure: initial results from a joint BioMedBridges workshop Stephanie.
Canadian Bioinformatics Workshops
Bacterial Genome Assembly
Fig. 1. Sample NGS data in FASTQ format (SRA's srr032209), with parts being shortened and numbered: (1) read identifiers; (2) sequence of bases; (3) ‘+’
How to Convert Pictures into Numbers
Lesson Objectives Aims You should know about: 1.3.1:
Steganography Example
Unit 2- Lesson 1 & 2- Bytes and File Sizes / Text Compression
Unit 2- Lesson 1 & 2- Bytes and File Sizes / Text Compression
Sequencing technology and assembly
The FASTQ format and quality control
Simple Mail Transfer Protocol
Data Compression.
2nd (Next) Generation Sequencing
1. Explain how ASCII is used to represent text in a computer system
26.5 Molecular Clocks Help Track Evolutionary Time
Unit 2- Lesson 1 & 2- Bytes and File Sizes / Text Compression
BF nd (Next) Generation Sequencing
Canadian Bioinformatics Workshops
Chapter 8 – Compression Aims: Outline the objectives of compression.
BF528 - Sequence Analysis Fundamentals
Presentation transcript:

EBI is an Outstation of the European Molecular Biology Laboratory. CRAM: reference-based compression format developed by Vadim Zalunin

Data horror EMBL-EBI 10 petabytes SRA ~1 petabytes Over 2 million DVDs or 2.5km Complete Genomics 0.5 TB for a single file

The need for compression Red alert

Compression, what is it? BMP, 190 kbPNG, 100 kbJPG, 21 kbJPG, 4 kb LOSSLESSLOSSY

Compression, when we know what to expect. BMP, 145 kbPNG, 2 kbJPG, 6 kbJPG, 3 kb LOSSLESSLOSSY But the actual message is only 40 characters (bytes) long!

Compression at it’s best IMAGE, 145 kb "Five little ducks went swimming one day" TEXT, 40 bIMAGE, 145 kb ~3500 times more efficient compressuncompress

What are we talking about sample sequencing machines bug bunch of huge files The bug’s DNA is hidden somewhere

Looking closer at the data bunch of huge files read 1 read 2 read 3 ….. read bizzilion It boils down to a long list of reads: Each read represents a short nucleotide sequence from the genome. Additional information may be attached to it, for example error estimates.

What is a CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file.

What is a CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read name An excerpt from of a FASTQ file.

What is a CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read nameread bases An excerpt from of a FASTQ file. Bases: ACGTN

What is a CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read nameread bases read quality scores An excerpt from of a FASTQ file. Bases: ACGTN Quality scores: from ‘!’ (ASCII 33) to ‘~’ (ASCII 126)

What is quality score? Then quality score is phred quality score encoded as ASCII symbols Basically: higher scores are better, so ‘!’ is bad, ‘I’ is good.

Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read 1 TGAGCTCTTAGTAGC read 2 GCTCTAAGTAGCCGC read 3 CTCTAAGTAGCCGCG read 4 GTAGCCGCGGACTGT read 5 CGGTCTGTCCG Read start positionRead end position

Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read T read read read A.... read

Reference based encoding Reference sequence TGAGCTCTAAGTACCCGCGGTCTGTCCG read T read read read A.... read Mismatching bases

Lossy quality scores Approach 1 Quality scores are usually values from 0 to 39. Let’s shrink them, so that they are from 0 to 7 now. Approach 2 Let’s treat quality scores using alignment information. For example: preserve only quality scores for mismatching bases. horizontal vertical

Comparison study:1K Genomes exomes compressuncompress BAM CRAM

compressuncompress Comparison study:1K Genomes exomes BAM CRAM Some analysis pipeline

compressuncompress Comparison study:1K Genomes exomes BAM CRAM Some analysis pipeline Original SNPsRestored SNPs

Comparison study:1K Genomes exomes

CRAM NGS data compression Do nothing CRAM lossy Untreated CRAM very lossy Lossless Lossy Bits/base CRAM lossless (bad)(good)

Progressive application of compression Sample value Sample accessibility 200-foldLossless2-fold20-fold Hard High Easy Low

References More information: Mailing list: Publications: Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. Gigascience 1