IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis

Gigabases Cost per Kb Lucinda Fulton, The Genome Center at Washington University CostThroughput

Sequencing Technologies http://www.geospiza.com/finchtalk/uploaded_images/plates-and-slides-718301.png

Sequence “Space” Roche 454 – Flow space – Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain – Flow space describes sequence in terms of these base incorporations – http://www.youtube.com/watch?v=bFNjxKHP8Jc http://www.youtube.com/watch?v=bFNjxKHP8Jc AB SOLiD – Color space – Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known bases with a flouorescent dye – Each base sequenced twice – http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related Illumina/Solexa – Base space – Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups – Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH – http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related GenomeTV – Next Generation Sequencing (lecture) – http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html

Flexible Good: with rapidly changing data/tech Poor: validation Human Readable Convenient for de-bugging Computer doesn’t care!

Sequences FASTA FASTQ SAM/BAM Alignments SAM/BAM MAF Annotations BED GTF GFF3 GVF VCF http://genome.ucsc.edu/FAQ/FAQformat.html http://www.sequenceontology.org/

FASTQ FASTA

FASTQ: Data Format FASTQ – Text based – Encodes sequence calls and quality scores with ASCII characters – Stores minimal information about the sequence read – 4 lines per sequence Line 1: begins with @; followed by sequence identifier and optional description Line 2: the sequence Line 3: begins with the “+” and is followed by sequence identifiers and description (both are optional) Line 4: encoding of quality scores for the sequence in line 2 References/Documentation – http://maq.sourceforge.net/fastq.shtml http://maq.sourceforge.net/fastq.shtml – Cock et al. (2009). Nuc Acids Res 38:1767-1771. Sequence data format

FASTQ Example FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771. For analysis, it may be necessary to convert to the Sanger form of FASTQ.

FASTQ: Details FASTQ – Text based – Encodes sequence calls and quality scores with ASCII characters – Stores minimal information about the sequence read – 4 lines per sequence Line 1: begins with @; followed by sequence identifier and optional description Line 2: the sequence Line 3: begins with the “+” and is followed by sequence identifiers and description (both are optional) Line 4: encoding of quality scores for the sequence in line 2 References/Documentation – http://maq.sourceforge.net/fastq.shtml http://maq.sourceforge.net/fastq.shtml – Cock et al. (2009). Nuc Acids Res 38:1767-1771.

Q = Phred Quality Scores P = Base-calling error probabilities Quality scores

!"#$%&'()*+,-./0123456789:; ?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator L - Illumina 1.8+ Phred+33, raw reads typically (0, 41) Format/PlatformQualityScoreTypeASCII encoding SangerPhred: 0-9333-126 SolexaSolexa:-5-6264-126 Illumina 1.3Phred: 0-6264-126 Illumina 1.5Phred: 0-6264-126 Illumina 1.8Phred: 0-6233-126 *** Sanger format! Quality score encoding differ among the platforms Most analysis tools require Sanger fastq quality score encoding

http://main.g2.bx.psu.edu/

SAM (Sequence Alignment/Map) SAM is the output of aligners that map reads to a reference genome – Tab delimited w/ header section and alignment section Header sections begin with @ (are optional) Alignment section has 11 mandatory fields – BAM is the binary format of SAM http://samtools.sourceforge.net/ Alignment data format

http://samtools.sourceforge.net/SAM1.pdf Mandatory Alignment Fields

http://samtools.sourceforge.net/SAM1.pdf Alignment Examples Alignments in SAM format CIGAR string -> 8M2I4M1D3M

Annotation Formats Mostly tab delimited files that describe the location of genome features (i.e., genes, etc.) Also used for displaying annotations on standard genome browsers Important for associating alignments with specific genome features descriptions Knowing format details can be important to translating results! – BED is zero based – GTF/GFF are one based

GTF http://useast.ensembl.org/info/website/upload/gff.html Annotation data format

chr18611426586116346nsv433165 chr218417741846089nsv433166 chr1629504462955264nsv433167 chr171435038714351933nsv433168 chr173283169432832761nsv433169 chr173283169432832761nsv433170 chr186188055061881930nsv433171 chr11675982916778548chr1:21667704270866- chr11676319416784844chr1:146691804407277+ chr11676319416784844chr1:144004664408925- chr11676319416779513chr1:142857141291416- chr11676319416779513chr1:143522082293473- chr11676319416778548chr1:146844175284555- chr11676319416778548chr1:147006260284948- chr11676341116784844chr1:144747517405362+ BED format Annotation data format

BED: zero based, start inclusive, stop exclusive GTF/GFF: one based, inclusive Length = stop-start Length = stop-start+1

GRCh37 NCBI36

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.

Similar presentations

Presentation on theme: "IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.

Similar presentations

Presentation on theme: "IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis."— Presentation transcript:

Similar presentations

About project

Feedback