Next generation sequencing Xusheng Wang 4/29/2010
Outline Background Technologies Data analysis and Applications
Why sequence DNA?
De novo sequencing genome Homo sapiens Mus musculus Rattus norvegicus Pan troglodytes Macaca mulatta Drosophila melanogaster Danio rerio Takifugu rubripes Arabidopsis thaliana oryza sativa Caenorhabditis elegans
Individual human genome sequencing
Interests to me… C57BL/6J (B) DBA/2J (D) F1 20 generations brother-sister matings BXD1 BXD2 BXD80 + … + F2 BXD RI Strain set BXD RI Strain set fully inbred fully inbred isogenic hetero- geneous hetero- geneous Recombined genomes are needed for mapping female male chromosome pair Inbred Isogenic siblings Inbred Isogenic siblings BXD
Cancer genome sequencing
Map-and-count experiments RNA-seqChIP-seq
History of DNA sequencing Messing & Llaca, PNAS (1998) Sanger Sequencer
Next generation sequencing technologies ABI SOLiDIllumina GA2 Roche 454
Single molecular sequencing technologies Helicos Single Molecule Real Time (SMRT) DNA sequencing
Comparison of NGS platforms Michael L. Metzker Nature Reviews Genetics 11, (Jan 2010)
Roche (454) GS FLX sequencer
Hiseq 2000
A A T T A A T T Prepare library Illumina Genome Analyzer
Prepare clusters Illumina Genome Analyzer
Prepare clusters Illumina Genome Analyzer
Sequencing Illumina Genome Analyzer A T C G
Sequencing Illumina Genome Analyzer A T C G
Sequencing Illumina Genome Analyzer A
Sequencing Illumina Genome Analyzer A T C G A
Sequencing Illumina Genome Analyzer A T
xwang TGGAAGAATATAGAGCCTGTCACAATCCTCCCTTTGAGCAGCATTAGTCTACAAAGGAAAAGAAAGT TCTCATGACTCTAGTGCCACCCTCACATACTTAC `_ab_ZaabbbaY`\_\a[[_a`aaa]`_aa\`aa[a\\aW]``a`VW \aa`aZ__Y]Z_aWZV_a][]][a`Y^X[\\[FKT``F\[^W`^TVZTVXODD xwang GGGCACTCTTGTGCGGCAACGGCTGGGTGAGGACTCAACGGGGCCCCGTCCTGTCTAGCCTCGC CCTCGCTTGCGGGACCAGACCGGACACTGGCGAAGTA X\\V_aa_aaa_a_R[[aa^U^`aV^_HXT[NMYPU_\PU]VTRZU[P K`_HIV[GG]IHSDG\_XYGPDW_LFHIOJT`ROJGTDDTZPIGVKJIJDGMD xwang CTTGCAGCAGATGTCTGGACTCCTCCAAAATACATGCCTAGGCGTCAACGCAGTTACCACCTGCTTT CCGCCAGTGATGCGTCCTCCTGGG.TCGCGTCTC VFHMMZOMZKMMZRGFFTHDZRFYMFHFDOTFJDDMEWKYHMMOHDDJ ZIDIGDDV]QIPNFGDODHHMFDDFDIFKDP_DHWDFDRHXHFDRGGGGHYDZ xwang ATGCTGACCAATCCGGAAC.CTCGGCTAGAAAACGCCAGGGGTCGAGAGAAGAATAATCTACAATCC GAAACAGCCAGGGAGTGAAACAGTATACGTGTAT [MRD[WKIPGSMJDQUVRDDJJPSDDMGPPYY_GGMDRFNFDDMHHDJ JHLPZMKDMOMJJDJDWIDRMNHIDHHDHLDNMDDRMMMDSKDDKHHNOFHDR 0 Data from Illumina GA2 LaneTileX Y Filter: 0-No; 1-Yes 1:Single; 2: paired endIndexMechine name
SOLiD system
SOLID sequencing
Ligation-based sequencing
Decoding color space Raw error rate = ~3% Corrected error rate = ~0.1%
Single SNP detection
Data from SOLiD (.csfasta) >4_27_99_F3 T >4_27_1062_F3 T >4_27_1570_F3 T >4_28_935_F3 T >4_28_1306_F3 T >4_29_429_F3 T >4_29_506_F3 T >4_29_636_F3 T >4_29_940_F3 T >4_29_1957_F3 T >4_31_522_F3 T >4_31_1523_F3 T
Quality value (_QV.qual) >4_27_99_F >4_27_1062_F >4_27_1570_F >4_28_935_F >4_28_1306_F >4_29_429_F
Comparing Sequencers read length bases per machine run 10 bp1,000 bp200 bp 1 Gb 100 Mb 1Mb 100Gb Illumina, AB/SOLiD short-read sequencers ABI capillary sequencer 454 pyrosequencer ( Mb in bp reads) (100Gb in bp reads)
Comparing Sequencers Roche (454)Illumina GASOLiD ChemistryPyrosequencingReverse terminatorLigation-based AmplificationEmulsion PCRBridge AmpEmulsion PCR Paired ends/sepYes/3kbYes/200 bpYes/3 kb Mb/run100 Mb20 Gb100 Gb Time/run7 h4 days/8days7 days / 14 days Read length400 bp bp50 bp
Data volume
… and they give you the picture on the box Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Problem is, some pieces are easier to place than others…
Alignment of reads Reads generated from sequencing is mapped to a reference genome Conventional tools like Blast or Blat do not work well with short sequence reads. Modification of existing alignment algorithms to handle short reads.
Alignment Tools ELAND MAQ Bowtie SOAP Bioscope/Corona Lite pipeline Tophat BFAST
SAM/BAM format File format version Sequence name; Sequence length read group
Sequence variations SNPsInsertionDeletion Medium Insertion
SNPs between the C57BL/6J and DBA/2J 4,553,000 SOLiDIllumina
Evaluating SNPs calls 5% adjacent SNPs 20% adjacent SNPs
21 August 2015 D 321 ChrID Supports: TAAGAATGAGTTGGCAAATAAAGAGTTTGGTGAGTTTATAGAAATATAGGggccg ataggACAAGGTACAAGGAATGGCTGAAGGAGAGAGGTTG GAGTTTATAGAAATATAGG ACAAGGTACAAGGAATG GTGAGTTTATAGAAATATAGG ACAAGGTACAAGGAA GAGTTTATAGAAATATAGG ACAAGGTACAAGGAATG TGGTGAGTTTATAGAAATATAGG ACAAGGTACAAGG GAGTTTATAGAAATATAGG ACAAGGTACAAGGAATG AGTTTGGTGAGTTTATAGAAATATAGG ACAAGGTACAAGGA GTGAGTTTATAGAAATATAGG ACAAGGTACAAGGAA AGTTTATAGAAATATAGG ACAAGGTACAAGGAATGG TTTGGTGAGTTTATAGAAATATAGG ACAAGGTACAA TGAGTTTATAGAAATATAGG ACAAGGTACAAGGAATG TGAGTTTATAGAAATATAGG ACAAGGTACAAGGAAT GTTTATAGAAATATAGG ACAAGGTACAAGGAATGGC GAGTTTATAGAAATATAGG ACAAGGTACAAGGAATG AGTTTATAGAAATATAGG ACAAGGTACAAGGAATGG base - 1million bases Medium indels detection
Large indels detection K. Chen et al., Nature Methods 6: (2009) Concordance Insertion Deletion Clone inserted size
InDels between the C57BL/6J and DBA/2J
Inversion detected by paired-end data Total Inversions Span exon(s) or gene(s) IntronsIntergenic
Copy Number Variations (CNVs) Total CNVsGainsLosses 21,7397,18214,557 Graubert, et al PLoS Genetics Anderson, et al Genes & Immunity Several gene members of Klra family was deleted in DBA/2J
De novo assembly ABySS ALLPATHS Euler-SR SHRAP SSAKE Velvet SOAP
Variation viewed at a genome scale
RNA sequencing
.csfasta Filtering Ribosomal RNA tRNA TTTT / AAAA Adapters Alignments Reference sequences Merging and sorting Counting reads Novel transcripts RNA sequencing analysis pipeline
Alignment methods Transcriptome reads that cross splice junctions Anchor Extend method
Alternative splicing Novel Transcribed Region (NTR) Definition: a segment of genomic sequence that is transcribed but is not currently annotated as an exon in a database
Our RNAseq data on UCSC genome browser bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Williamslab&hgS_otherUserSessionName=eye _RNAseq
Finding the new SNP data
Finding the indel data
Using the new sequence data