DM ChurchLast Updated: 7 May 2012 Intro to Next Generation Sequencing
DM ChurchLast Updated: 7 May Nick Loman and James Hadfield
DM ChurchLast Updated: 7 May 2012
DM ChurchLast Updated: 7 May 2012 Koboldt et al., 2010 (Figure 3)
DM ChurchLast Updated: 7 May 2012
DM ChurchLast Updated: 7 May 2012 Bench work to build libraries and sequence Clean up and QA reads Alignments to Genome or Transcriptome Analysis of Alignments
DM ChurchLast Updated: 7 May 2012 Koboldt et al., 2010 Sample Contamination Library chimeras Sample mix-ups Tumor-normal switches Run quality
DM ChurchLast Updated: 7 May 2012 Koboldt et al, (Fig 4A)
DM ChurchLast Updated: 7 May 2012
DM ChurchLast Updated: 7 May 2012 Chor et al., 2009
DM ChurchLast Updated: 7 May 2012 CCL Bio
DM ChurchLast Updated: 7 May 2012 GCTACGGCATTCAGGCATCAGGCATTAGCAG GGCATTCAGGGATCAGGCATTAGC-> <-CATGGCATTCAGGGATCAGGCATT <-GCCATGGCATTCAGGGATCAGGC CATTCAGGGATCAGGCATTAGCAG-> GGCATTCAGGGATCAGGCATTAGC-> CATTCAGGGATCAGGCATTAGCAG-> GGCATTCAGGGATCAGGCATT-> <-GGATCAGGCATTAGCAG <-GATCAGGCATTAGCAG <-GGATCAGGCATTAGCAG
DM ChurchLast Updated: 7 May 2012 High Coverage: qualities may not be needed
DM ChurchLast Updated: 7 May 2012 Low Coverage: qualities are important
DM ChurchLast Updated: 7 May 2012 Custodia-Lora et al., 2003
DM ChurchLast Updated: 7 May 2012 FASTQ Example FASTQ example from: Cock et al. (2009). Nuc Acids Res 38: For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example, Illumina stores quality scores ranging from 0-62; Sanger quality scores range from Solexa quality scores have to be converted to PHRED quality scores.
DM ChurchLast Updated: 7 May 2012 SAM (Sequence Alignment/Map) It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format – SAM is the output of aligners that map reads to a reference genome – Tab delimited w/ header section and alignment section Header sections begin (are optional) Alignment section has 11 mandatory fields – BAM is the binary format of SAM
DM ChurchLast Updated: 7 May Mandatory Alignment Fields
DM ChurchLast Updated: 7 May Alignment Examples Alignments in SAM format
DM ChurchLast Updated: 7 May 2012 chr nsv chr nsv chr nsv chr nsv chr nsv chr nsv chr nsv chr chr1: chr chr1: chr chr1: chr chr1: chr chr1: chr chr1: chr chr1: chr chr1: Valid BED files
DM ChurchLast Updated: 7 May 2012 GTF
DM ChurchLast Updated: 7 May 2012 ##gff-version 3 ##gvf-version 1.02 ##species ##genome-build NCBI MGSCv36 ##assembly-name MGSCv36 ##assembly-accession GCF_ ##file-date # Study_accession: Combined studies on MGSCv36 # Display_name: Combined studies on MGSCv36 # Study_description: Combined studies on MGSCv36 chr1dbVarcopy_number_variation ID=nsv433533;Name=nsv433533;Start_range=., ;End_range= ,. chr4dbVarcopy_number_variation ID=nsv433534;Name=nsv433534;Start_range=., ;End_range= ,. chr9dbVarcopy_number_variation ID=nsv433535;Name=nsv433535;Start_range=., ;End_range= ,. chr17dbVarcopy_number_variation ID=nsv433536;Name=nsv433536;Start_range=., ;End_range= ,. chr17dbVarcopy_number_variation ID=nsv433537;Name=nsv433537;Start_range=., ;End_range= ,. chr17dbVarcopy_number_variation ID=nsv433538;Name=nsv433538;Start_range=., ;End_range= ,. GVF format
DM ChurchLast Updated: 7 May Derived data
DM ChurchLast Updated: 7 May 2012 Derived data
DM ChurchLast Updated: 7 May 2012 Actual data
DM ChurchLast Updated: 7 May 2012 Getting exponential growth under control
DM ChurchLast Updated: 7 May 2012 Trace Organization seq1 seq2 FASTA Quality Chromatogram Experimental info Sample FASTA Quality Chromatogram Experimental info Sample SRA Organization Experiments Samples Sequences and Qualities
DM ChurchLast Updated: 7 May 2012 Era of NGS Explosion FASTQ Era Bits/Base Era As of April 10, 2012 SRA contains less bytes then bases
DM ChurchLast Updated: 7 May 2012 New Cycle Decision Circle What data series to store Redundancy removal Normalization Lossy vs Lossless Compression tuning Practical Application BAM and similar formats containing both raw reads and alignments become primary output of raw sequencing Increases the number of data series Compression By Reference reduces sizes of other data series New sets of tradeoffs New compression algorithms
DM ChurchLast Updated: 7 May 2012 Analyzing New Compression Method Data from 1000 Genome Project All available combinations of samples, platforms, and aligners 3114 files 27 Tb of disk space after compression BAMs from 1000 Genome Project Names are dropped after restoring mates Only sequencing quality score is saved None of non-redundant optional tags are preserved BAM treatment Occasional alignments to stretches of Ns on the reference and beyond the reference were converted to unaligned Different PCR duplicate flags for mates Correction of BAM inconsistencies
DM ChurchLast Updated: 7 May 2012 Changes To SRA Run Browser
DM ChurchLast Updated: 7 May
DM ChurchLast Updated: 7 May
DM ChurchLast Updated: 7 May
DM ChurchLast Updated: 7 May 2012 Science 1 July 2011: Vol. 333 no pp DOI: /science
DM ChurchLast Updated: 7 May 2012 Li et al., 2011, Figure 1
DM ChurchLast Updated: 7 May 2012 Li et al., 2011 Fig. 2
DM ChurchLast Updated: 7 May 2012 Kleinman et al., 2012 Fig 1
DM ChurchLast Updated: 7 May 2012 Kleinman et al., 2012 Table 1
DM ChurchLast Updated: 7 May 2012 Lin et al., 2012 Fig 1
DM ChurchLast Updated: 7 May 2012 Lin et al., 2012 Fig 2
DM ChurchLast Updated: 7 May 2012 Pickrell et al., 2012 Fig 1
DM ChurchLast Updated: 7 May 2012 Li et al, 2012 Fig 1
DM ChurchLast Updated: 7 May 2012 Li et al., 2012 Fig 2
DM ChurchLast Updated: 7 May 2012 Li et al., 2012 Fig 3
DM ChurchLast Updated: 7 May 2012 Li et al, 2012 Fig 4