- DNA sequencing in the last century - Current technologies (Illumina, Ion Torrent) - New developments (PacBio, Nanopore) Topics
T Sanger sequencing - Random incorporation of blocked nucleotides at any position, reaction stops in a small fraction of the reads TTGCACTTGAGTCGT AACGTGAACTCAGCATAGGCTCAGATAGAT A-Reaction: add dATP (elongation) and ddATP (block) Analogous: C-, G-, T-Reaction ddATP - Developed by Fred Sanger in the 70ies ( , 2*Nobel laureate: 1958 – protein structure of insulin, 1980 – sequencing of nucleic acids) - Sequencing by synthesis: DNA polymerase is synthesizing a complementray strand by adding single nucleotides TTGCACTGAGTCG AACGTGACTCAGCATAGGCTCAGATAGAT
TTGCACTTGAGTCG AACGTGAACTCAGCATAGGCTCAGATAGAT A-Reaction: TTGCA TTGCACTTGA C-Reaction: TTGC TTGCAC TTGCACTTGAGTC G-Reaction: TTG TTGCACTTG TTGCACTTGAG TTGCACTTGAGTCG T-Reaction: TT TTGCACT TTGCACTT TTGCACTTGAGT ddNTP Sanger sequencing ladder of DNA fragments electrophoresis sequence T G C A
GATTGATAGTTGC CTAACTATCAACGTATAGGCTCAGATAGAT G GA GAT GATT GATTG GATTGA GATTGAT GATTGATA GATTGATAG GATTGATAGT GATTGATAGTT GATTGATAGTTG GATTGATAGTTGC - labeled ddNTPS, capillary sequencing A Sanger sequencing
Pyrosequencing - immobilize DNA on beads, pyrosequencing in microreactors dTTP TTGCACTGAGTCGT AACGTGACTCAGCATAGGCTCAGATAGAT PPi ATP Oxyluciferin + light 454 technology
DNA-loaded beads + primer + polymerase + sulfurylase + luciferase flowgram TTGCACTGAGTCGT AACGTGACTCAGCAAGTCTATTCACCCAC technology Problem: homopolymers difficult to detect
increase throughput: - DNA gel electrophoresis, single genes in few days - capillary electrophoresis, 96 capillaries per machine, human genome in a few years - sequencing on microbeads: 454 technology Parallelisation & Miniaturisation
Illumina sequencing: - sequencing by synthesis - massive parallelisation and miniaturisation by self-organising DNA microarrays on a glass surface - several hundred Gb, >10 9 reads per run Illumina technology
- generate libraries - grow clusters on a flowcell - sequence by addition and imaging of blocked & fluorescence-labeled nucleotides Illumina technology
library preparation: DNA fragments Blunting by Fill-in and exonuclease Phosphorylation Addition of A-overhang Ligation to adapters Illumina technology
cluster generation: 1. flowcell Illumina technology
cluster generation: 1. flowcell 2. hybridize template Illumina technology
cluster generation: 1. flowcell 2. hybridize template 3. immobilize template Illumina technology
cluster generation: 1. flowcell 2. hybridize template 3. immobilize template 4. bridge amplification Illumina technology
cluster generation: 1. flowcell 2. hybridize template 3. immobilize template 4. bridge amplification Illumina technology
cluster generation: 1. flowcell 2. hybridize template 3. immobilize template 4. bridge amplification Illumina technology
cluster generation: 1. flowcell 2. hybridize template 3. immobilize template 4. bridge amplification Illumina technology
cluster generation: 1. flowcell 2. hybridize template 3. immobilize template 4. bridge amplification Illumina technology
cluster generation: 1. flowcell 2. hybridize template 3. immobilize template 4. bridge amplification 5. linearisation Illumina technology
cluster generation: 1. flowcell 2. hybridize template 3. immobilize template 4. bridge amplification 5. linearisation 6. cleave reverse strand Illumina technology
cluster generation: 1. flowcell 2. hybridize template 3. immobilize template 4. bridge amplification 5. linearisation 6. cleave reverse strand 7. block 3‘-ends Illumina technology
cluster generation: 1. flowcell 2. hybridize template 3. immobilize template 4. bridge amplification 5. linearisation 6. cleave reverse strand 7. block 3‘-ends 8. hybridize primer Illumina technology
Imaging & Sequencing: Illumina technology Nucleotide + fluorescent dye + terminator
reversible terminators: Illumina technology
fluorescently labelled clusters: Illumina technology
what can we do with short reads? RNA-seq, identify transcripts, count #reads per transcript assessment of differential expression problem: reads are too short to establish connectivity of all exons, difficult/impossible to quantify multiple isoforms of a gene Sequencing Applications
Stefan Krebs, Single end: ambiguous mapping Paired end sequencing: read fragment from both ends -> resolve ambiguities Improvements: Paired end Reads
further improvements long jumping mate-pair libraries: circularize large fragment and reads junctions (2-10 kb) resolve large repeats in genome assembly Improvements: Circularization
Third generation Sequencing
- single molecule detection -several kilobases read length -moderate output ( wells) -expensive instrument and high cost per base Pacific Biosciences
Read length distribution
Pacific Biosciences
everything that can be converted to a DNA strand can be sequenced - even long-term data storage by encoding in synthetic DNA is possible BIOLOGICAL APPLICATIONS: sequencing of genomes, transcriptomes, population diversity, composition of microbial communities, ChIPseq, methyl-Seq, translating RNA from ribosomes,... MEDICAL APPLICATIONS: whole genome sequencing, exome sequencing, tumor diagnostics, sequencing of T-cell receptor diversity, identification of pathogens,... FORENSICS, FOOD SAFETY, ARCHEOLOGY, … Applications
Chromatin Immunoprecipitation (ChIP)
mRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Motivation: Regulation of gene expression Transcriptional Post-transcriptional
At which loci does a protein bind the DNA? Are there cell-type or environment-specific variations of binding affinity? Which histone modifications determine chromatin structure? To which motifs does a transcription factor bind? What is the “cis-regulatory code” of a gene? Motivation: Regulation of gene expression DNA Activation Repression x Enhancer Promoter
Sequencing DNA binding protein of interest Antibody Chromatin Immunoprecipitation (ChIP)
Control: input DNA Chromatin Immunoprecipitation (ChIP) Sequencing
ChIP-Seq Analysis Workflow Peak Detection Annotation Motif Analysis Visualization Alignment Chromatin Immunoprecipitation (ChIP) ELAND Bowtie SOAP SeqMap … SISSRs QuEST MACS CisGenome … STAN chromHMM … IGV Ensembl GB UCSC GB … cERMIT HMMer Xxmotif …
ACCAATAATCAGCTAAGCCGTTAGCCACAGATGGAA Protein of interest Chromatin Immunoprecipitation (ChIP) Sonication crosslink site
Read Alignment
Read count genome Expected read count Expected read count = total number of reads * extended fragment length / chr length genome T A T T A A T T A T C C C C A T A T A T G A T A T Read Alignment
Read direction provides extra information Hongkai Ji et al. Nature Biotechnology 26: Read Alignment
The ENCODE Project Goal: Define all functional elements in the human genome How: Lots of groups Lots of assays Lots of cell lines Lots of communication/consortium analysis Standardization of methods, reagents, analysis Genome-wide A lot of money
47 2 Tier 1 cell lines –GM12878 (B cell) –K562 (CML cells) 5 Tier 2 cells –HeLa S3, HepG2, HUVEC, primary keratinocytes, hESC Many Tier 3 cells RNA profiling (Scott Tenenbaum): Inter-cell line differences are greater than inter-lab differences The ENCODE Project
48 RNA-seq RNA-array TF ChIP-seq Histone modif ChIP-seq DNase-seq Bisulfite-seq 1M SNP genotyping Lots of data and data types generated by The ENCODE Project
49 Dynamic Bayesian Networks HMM segmentation PCA analysis Open Chromatin Trans. Factor Chip-seq Histone Mod. Chip-seq RNA Std. Peaks Region callsActive regions …… Biological interpretation Integrative Data Analysis
50 12 Histone modifications 2 Transcription factors GM12878 K562 “Standard” EM Training Posterior Probability Decoding Genome Viterbi Path State FState IState AState CState E Data: Entire ENCODE Consortium Analysis: Jason Ernst/Manolis Kellis 25-state HMM Integrative Data Analysis