Download presentation
Presentation is loading. Please wait.
Published byEustacia Owen Modified over 8 years ago
1
Canadian Bioinformatics Workshops www.bioinformatics.ca
2
2Module #: Title of Module
3
Place an image representing the talk here Module 2 File Formats Place your institute’s logo here
4
File Formats bioinformatics.ca Great… I got my data now what… Data and information management is slowly moving out of infancy in genomics science…. at the toddler stage… The Good news – Some data formats are being accepted widely The Bad new – Still many competing standards in some areas – Interoperability of data standards is almost non-existent – Governance is questionable
5
File Formats bioinformatics.ca Data Format Types Raw Sequence Data e.g. fasta Aligned data e.g. BAM Processed data e.g. BED
6
File Formats bioinformatics.ca Raw Sequence Data Format fasta csfasta fastq csfastq SRF …. And about 30 other file formats http://emboss.sourceforge.net/docs/themes/SequenceF ormats.html
7
File Formats bioinformatics.ca (cs)Fasta/(cs)Fastq FASTA – Header line “>” – Sequence FASTQ – Add QVs encoded as single byte ASCII codes Most aligners accept FASTA/Q as input Issue: data is volumous (2 bytes per base for FASTQ) Do PHRED scaled values provide the most information?
8
File Formats bioinformatics.ca SRF More compact than FASTQ, but harder to use Allows user to submit additional data e,g, additional QVs and intensity values Community appears to be converging on 1 base value and 1 QV Possible to compress to 1 byte / base Should you have to care about input data formats?
9
File Formats bioinformatics.ca Aligned Data - BAM BAM – Binary version of SAM – Sequence Alignment/Map – Binary makes this format more compact Contains information about the alignment of a read to a genome Includes mate pair / paired end information joining distinct reads Quality of alignment denoted by mapping/pairing QV Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]
10
File Formats bioinformatics.ca More BAM Many tertiary analysis tools use BAM – BAM makes machine specific issues “transparent” e.g. colour space SAVANT Fiume et al IGV http://www.broadinstitute.org/igv
11
File Formats bioinformatics.ca Processed Data Formats GFF (BIG) BED (BIG) WIG
12
File Formats bioinformatics.ca GFF Column separated file format contains features located at chromosomal locations Not a compact format
13
File Formats bioinformatics.ca BED Created by USCS genome team Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser BIG BED – New format optimized for next gen data – essentially a binary version Kent et al. PMID: 20639541
14
File Formats bioinformatics.ca WIG Also created by USCS team WIG is optimized for storing “levels” Useful for displaying transcriptome, ChIP-seq etc bigWIG – binary WIG format
15
File Formats bioinformatics.ca Galaxy Galaxy has a number of file format conversion tools… more on Galaxy tomorrow.
16
Module Name Here bioinformatics.ca We are on a Coffee Break & Networking Session
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.