Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations


Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

1 Canadian Bioinformatics Workshops www.bioinformatics.ca

2 2Module #: Title of Module

3 Place an image representing the talk here Module 2 File Formats Place your institute’s logo here

4 File Formats bioinformatics.ca Great… I got my data now what… Data and information management is slowly moving out of infancy in genomics science…. at the toddler stage… The Good news – Some data formats are being accepted widely The Bad new – Still many competing standards in some areas – Interoperability of data standards is almost non-existent – Governance is questionable

5 File Formats bioinformatics.ca Data Format Types Raw Sequence Data e.g. fasta Aligned data e.g. BAM Processed data e.g. BED

6 File Formats bioinformatics.ca Raw Sequence Data Format fasta csfasta fastq csfastq SRF …. And about 30 other file formats http://emboss.sourceforge.net/docs/themes/SequenceF ormats.html

7 File Formats bioinformatics.ca (cs)Fasta/(cs)Fastq FASTA – Header line “>” – Sequence FASTQ – Add QVs encoded as single byte ASCII codes Most aligners accept FASTA/Q as input Issue: data is volumous (2 bytes per base for FASTQ) Do PHRED scaled values provide the most information?

8 File Formats bioinformatics.ca SRF More compact than FASTQ, but harder to use Allows user to submit additional data e,g, additional QVs and intensity values Community appears to be converging on 1 base value and 1 QV Possible to compress to 1 byte / base Should you have to care about input data formats?

9 File Formats bioinformatics.ca Aligned Data - BAM BAM – Binary version of SAM – Sequence Alignment/Map – Binary makes this format more compact Contains information about the alignment of a read to a genome Includes mate pair / paired end information joining distinct reads Quality of alignment denoted by mapping/pairing QV Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]

10 File Formats bioinformatics.ca More BAM Many tertiary analysis tools use BAM – BAM makes machine specific issues “transparent” e.g. colour space SAVANT Fiume et al IGV http://www.broadinstitute.org/igv

11 File Formats bioinformatics.ca Processed Data Formats GFF (BIG) BED (BIG) WIG

12 File Formats bioinformatics.ca GFF Column separated file format contains features located at chromosomal locations Not a compact format

13 File Formats bioinformatics.ca BED Created by USCS genome team Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser BIG BED – New format optimized for next gen data – essentially a binary version Kent et al. PMID: 20639541

14 File Formats bioinformatics.ca WIG Also created by USCS team WIG is optimized for storing “levels” Useful for displaying transcriptome, ChIP-seq etc bigWIG – binary WIG format

15 File Formats bioinformatics.ca Galaxy Galaxy has a number of file format conversion tools… more on Galaxy tomorrow.

16 Module Name Here bioinformatics.ca We are on a Coffee Break & Networking Session


Download ppt "Canadian Bioinformatics Workshops www.bioinformatics.ca."

Similar presentations


Ads by Google