SOLiD Sequencing & Data
Overview Uses for the SOLiD system Starting Material -> Final Library Material Bead Preparation & Deposition (Slide Overview) Sequencing Process (‘Colorspace’ vs Basecalls) Data Formats & Derivative Data Overview Future Topics
Uses for SOLiD: Anything where a reference is available Resequencing SNP or Indel studies Exome or other Capture Whole Genome Abundance studies Transcriptome RNAseq Ribosomal Profiling Microbiome Small RNAs (miR or other) ChIP-Seq / RIP-Seq NOT suitable for deNovo sequencing (Assembly or unknowns) Technically it is possible, but other platforms would likely give FAR better results
Regardless of starting material, we sequence DNA fragments Regardless of starting material, we (or you) prepare a short DNA fragment library derived from it. Longer polynucleotides are generally sheared to smaller size Covaris or enzymatic digestion May depend on application! Getting specific ends is important to some applications (ChIP, Protections, etc) Mate Libraries may also be prepared where we want to sequence the ends of very large fragments RNA gets reverse transcribed to DNA Adapter sequences are added on in the process As extendable ligated stranded RT primers for RNA, or post shear/cleanup ligation for DNA fragments. CRITICAL: Adapter cleanup post ligation! This is a very common major contaminant in poorer library preparations
The Generic Derived Library Libraries have two end sequences used for both PCR and sequencing priming. “P1” is the universal Forward primer sequence. Secondary “P2” may have an embedded barcode sequence where applicable. Between the two adapter ends we have the DNA which will be sequenced from any combination of forward, reverse, and/or Barcode regions (green arrows). Note: Adapter sequences DIFFER from Illumina if other preparations are to be adapted to this platform.
Bead Preparation from Libraries A library or pool of libraries is subjected to emulsion PCR to populate beads Titrated oil micro-reactors such that each bead is populated by a single template. Unpopulated beads are removed in subsequent cleanup.
Slide Deposition of enriched beads Beads are prepared and flowed / adhered in the flowcell lanes. Low loading: little data Overloading: Unable to resolve single beads
Instrument Run Identifies single spots in each lane to track for signal. Camera images 708 “panels” on each lane
Colorspace “Colorspace” refers to the two-nucleotide encoding used by SOLiD. Tiled 5-bp steps with resets.
Colorspace 5-bp steps with resets. Di-nucleotide reads result in redundancy in calls In practice this translates to a slightly higher accuracy in mutation calls Resets in extensions means mis/non-incorporation or a bad cycle does not kill a read. It also allows cycles to be targeted to be repeated without rerunning everything. Drawback: resulting sequence is encoded in “colorspace” dinucleotide calls. Must use colorspace aligners for the data as-is (Lifescope) Possible to use an additional 3bp tiled reading cycle set to disambiguate and produce base-calls. (ECC) Possible to use the first base knowledge to walk a base sequence out, but any poor read anywhere will then cause a cascade of subsequent errors, better to use colorspace algorithms.
Data we get Data is by default in “XSQ” format A binary file/not human readable. Possible to export to ‘CSFASTA’ & ‘CSQUAL’ files which is in combination similar to FASTQ from Illumina. Some additional meta information is lost when doing so. Lifescope is the only existing aligner for XSQ data. CSFASTA: (Read ID, then Color calls [0-3 for the 4 dyes]. CSQUAL has quality scores for each read similarly) >600_50_31_F3 T2222002113300322132112231 >600_50_63_F3 T2330133212130133221033110 >600_50_100_F3 T0130001131012310201000101 FASTQ: (Read ID, then sequence, then a repeated sequence ID line, then quality scores for the read) @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Call Quality scores Call qualities are in ASCII and represent phred-scale scores. Depending on platform these have historically varied. (Basically, a log scale error probability)
Aligned data format BAM is the most common form of aligned sequencing data. This is a binary version of a SAM file. SAM are text/human readable, BAM is not. BAM files are highly compressed & index-able / optimized for rapid access of reads anywhere within. You don’t have to read the whole file if you want to look for reads at a gene in the middle of chromosome 7, for example. BAM files are supported by most genomic viewers. I suggest using IGV to visualize your BAM files.
IGV screenshot
Variant Call Format (VCF) Mutations are typically reported in VCF format. This is a tab-delimited text format (Human Readable). Many programs interpret this format. Varsifter will crunch the data for you in a filterable format. One line per mutation location. Position (chromosome, nt position), Reference base identity, Observed mutation identity, and quality data regarding that call per sample in the VCF file.