High-throughput sequencing analyses

High-throughput sequencing analyses
Curtis Huttenhower Slides courtesy of: Oliver HSPH Bioinformatics Core Michelle UMD IGS Istvan PSU Bioinformatics Harvard School of Public Health Department of Biostatistics

Biological samples Sequence reads Quality control Mapping Assembly
Metagenomes Mapping Assembly Contact maps ... 1. Generation of DNA library (ChIP-seq/DNAseq = 1-day; DGE/small RNA 3-4 days) 2. Amplification of library on flow cell; clonal cluster generation (~8 hours) 3. Data generation via sequencing-by-synthesis (48 hours for 36-cycle single read; 96 hours for Paired-End) 4. Alignment to reference genome (~8 hours depending on references) 5. Data analysis using different bioinformatics tools - Third Party Software Peak calling Variant Detection Annotation

Quality Control A typical pipeline: Duplicate removal Frequency checks
Reads: primers K-mers: barcodes Quality scores and trimming Length filtering

What’s wrong with this picture?
PCR duplicates: Bias during emulsion PCR Optical duplicates: One cluster detected >once Bainbridge et al. Genome Biology 2010 11:R62

Read Frequency Distribution
QA: filtering

Read Frequency Distribution
VecBase Screen > gnl|uv|NGB :1-219 pCR4-TOPO multiple cloning site Length=219 Score = 100 bits (50), Expect = 9e-19 Identities = 50/50 (100%), Gaps = 0/50 (0%) Strand=Plus/Plus Query 1 ATTAACCCTCACTAAAGGGACTAGTCCTGCAGGTTTAAACGAATTCGCCC 50 |||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 43 ATTAACCCTCACTAAAGGGACTAGTCCTGCAGGTTTAAACGAATTCGCCC 92 QA: filtering

K-mer spectra for QC

Error profiles De-phasing Crosstalk Degradation
(1) A strong correlation of the A and C intensities as well as of the G and T intensities due to similar emission spectra of the fluorophores and limited separation by the filters used. (2) Dependence of the signal of a specific cycle by the signal of the cycles before and after, called phasing and pre-phasing respectively. Phasing and pre-phasing are caused by incomplete removal of the 3' terminators and fluorophores, sequences in the cluster missing an incorporation cycle, as well as by the incorporation of nucleotides without effective 3' terminators. The Illumina base caller (Bustard) uses a so-called crosstalk matrix estimated from the first and second imaging cycle to orthogonalize the correlated channels. Further this matrix is used to scale the different intensities measured for each of the fluorophores. The estimation of the crosstalk matrix is based on the assumption that the four nucleotides are almost equally frequent in the library being sequenced. If the sample does not fulfill this assumption this estimate can be inaccurate and lead to incorrect base calling. Bustard estimates the phasing and pre-phasing as two channel- independent parameters from the increasing correlation of intensities in the first few cycles of the sequencing run. Using the crosstalk matrix and the two phasing parameters, it creates corrected intensity values and calls the base with the highest corrected intensity for each cluster and cycle. In the case of equal intensity values or small intensity differences an N is called. Error profiles

FastQC

FASTX-Toolkit

Alignment

SAM files Sequence Alignment/Map format
Is a concise file format that contains information about how sequence reads maps to a reference genome Can be further compressed in BAM format, which is a binary format of SAM. Can also be sorted and indexed to provide fast random access, using SAMtools (more on this in a minute). Requires ~1 byte per input base to store sequences, qualities and meta information. Supports paired-end reads and color space. Originally produced by bowtie and bwa, now a de facto standard. SAM/BAM can also be converted to pileup format for SNP calling.

SAM/BAM files BAM = Binary sAM
100% equivalent, just smaller and faster. Never save files as SAMs unless necessary! Save as BAM and use SAMtools to manipulate.

SAM Format coor ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT r TTAGATAAAGGATA*CTG r aaaAGATAA*GGATA r gcctaAGATAA r ATAGCT TCAGC r ttagctTAGGC r CAGCGCCAT Extended CIGAR The standard CIGAR description of pairwise alignment defines three operations: ‘M’ for match/mismatch, ‘I’ for insertion compared with the reference and ‘D’ for deletion. The extended CIGAR proposed in SAM added four more operations: ‘N’ for skipped bases on the reference, ‘S’ for soft clipping, ‘H’ for hard clipping and ‘P’ for padding. These support splicing, clipping, multi-part and padded alignments. QNAME FLAG RNAME POS MAPQ CIGAR MRNM MPOS ISIZE SEQ QUAL @SQ SN:ref LN:45 r ref M2I4M1D3M = TTAGATAAAGGATACTG * r ref S6M1P1I4M * AAAAGATAAGGATA * r ref H6M * AGATAA * r ref M14N5M * ATAGCTTCAGC * r ref H5M * TAGGC * r ref M * CAGCGCCAT

Pileup Standard format for mapped data, position summaries Seq. Pos.
seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6 seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< seq1 276 G T,,.,.,...,,,., ;+<<7=7<<7<&<<1;<<6< seq1 277 T ,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&< seq1 278 G ,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< seq1 279 C 23 A..T,,.,.,...,,,., ;75&<<<<<<<<<=<<<9<<:<< Seq. Pos. Len. Alignment Quality Ref.

http://saaientist. blogspot
Chimeras, Rearrangements, the problem of reference genomes Second level of QA Mismatched paired end reads

Search Mapping: BLAST Accelerated BLASTs
Short reads, no (few) mismatches Extremely speedy! BLAST Any sequences, configurable sensitivity Can accurately reach homology twilight zone Less speedy Accelerated BLASTs Any sequences, heuristic sensitivity Speedier

Accelerated BLASTs USEARCH map/mapx mblast/mblastx
All up to 1000sx faster Configurable options: Is there a “good enough” hit, yes/no? Just retrieve first N “good enough hits Trade sensitivity for specificity

Assembly Iverson et al. Science 3 February 2012: Vol. 335 no. 6068 pp

Assembly No such thing as an automated assembly Many alternatives:
Velvet, Newbler, SOAPdenovo, ABySS, ALLPATHS… Each has a fistful of tuning parameters

Annotation

Genescript: a canonical annotation pipeline
Hudek et al Bioinformatics (2003) 19 (9):

Genescript Hudek et al Bioinformatics (2003) 19 (9):

HMMs for gene calling and annotation

Center for Biological Sequence Analysis tools

Some annotation resources
End-to-end annotation pipelines Genescript Manatee Apollo HMMs HMMER Pfam/Rfam/TIGRfam TMHMM/SignalP/etc. ORF callers FragGeneScan (Meta)GeneMark

Variant calling When do you believe differences with respect to a reference genome? More reads = more support Errors are more likely at read ends Some sequences can be error hotspots Sometimes the reference genome’s the variant! When do you believe differences within your own sequences? Error hotspots Misalignment of near-repetitive regions

Variant detection SAMTools, Varscan
Tablet Second-Gen Visualizer (

Variant Call Format ##format=PCFv1 ##fileDate=20090805
##source=myImputationProgramV3.1 ##reference=1000GenomesPilot-NCBI36 ##phasing=partial #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA NA00002 rs G A NS=58;DP=258;AF=0.786;DB;H GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 T A q NS=55;DP=202;AF= GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 rs A G,T NS=55;DP=276;AF=0.421,0.579;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 T NS=57;DP=257;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 microsat1 G D4,IGA NS=55;DP=250;AA=G GT:GQ:DP 0/1:35: /2:17:2

ChIP-seq

RNA-seq Wang et al. Nat Rev Genet. 2009 January; 10(1): 57–63

Transcript discovery Further complicates the mapping story. Reference of all known (and putative/predicted!) transcripts

RNA-seq without a ref. genome
Contigs → Genes → Isoforms Grabherr et al. Nature Biotechnology 29, 644–652 (2011

Reads Per Kilobase per Million reads
Quantification Sample comparison RPKM: Reads Per Kilobase per Million reads

Creative uses Contact maps TFBS determination

The extended selection
222 apps The extended selection 220 applications and counting

The future: making sense of a genome

High-throughput sequencing analyses

Similar presentations

Presentation on theme: "High-throughput sequencing analyses"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High-throughput sequencing analyses

Similar presentations

Presentation on theme: "High-throughput sequencing analyses"— Presentation transcript:

Similar presentations

About project

Feedback