Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canadian Bioinformatics Workshops

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

1 Canadian Bioinformatics Workshops

2 2Module #: Title of Module

3 Place an image representing the talk here Module 2 File Formats Place your institute’s logo here

4 File Formats Great… I got my data now what… Data and information management is slowly moving out of infancy in genomics science…. at the toddler stage… The Good news – Some data formats are being accepted widely The Bad new – Still many competing standards in some areas – Interoperability of data standards is almost non-existent – Governance is questionable

5 File Formats Data Format Types Raw Sequence Data e.g. fasta Aligned data e.g. BAM Processed data e.g. BED

6 File Formats Raw Sequence Data Format fasta csfasta fastq csfastq SRF …. And about 30 other file formats ormats.html

7 File Formats (cs)Fasta/(cs)Fastq FASTA – Header line “>” – Sequence FASTQ – Add QVs encoded as single byte ASCII codes Most aligners accept FASTA/Q as input Issue: data is volumous (2 bytes per base for FASTQ) Do PHRED scaled values provide the most information?

8 File Formats SRF More compact than FASTQ, but harder to use Allows user to submit additional data e,g, additional QVs and intensity values Community appears to be converging on 1 base value and 1 QV Possible to compress to 1 byte / base Should you have to care about input data formats?

9 File Formats Aligned Data - BAM BAM – Binary version of SAM – Sequence Alignment/Map – Binary makes this format more compact Contains information about the alignment of a read to a genome Includes mate pair / paired end information joining distinct reads Quality of alignment denoted by mapping/pairing QV Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]

10 File Formats More BAM Many tertiary analysis tools use BAM – BAM makes machine specific issues “transparent” e.g. colour space SAVANT Fiume et al IGV

11 File Formats Processed Data Formats GFF (BIG) BED (BIG) WIG

12 File Formats GFF Column separated file format contains features located at chromosomal locations Not a compact format

13 File Formats BED Created by USCS genome team Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser BIG BED – New format optimized for next gen data – essentially a binary version Kent et al. PMID: 20639541

14 File Formats WIG Also created by USCS team WIG is optimized for storing “levels” Useful for displaying transcriptome, ChIP-seq etc bigWIG – binary WIG format

15 File Formats Galaxy Galaxy has a number of file format conversion tools… more on Galaxy tomorrow.

16 Module Name Here We are on a Coffee Break & Networking Session

Download ppt "Canadian Bioinformatics Workshops"

Similar presentations

Ads by Google