BF nd (Next) Generation Sequencing

Slides:



Advertisements
Similar presentations
The Past, Present, and Future of DNA Sequencing
Advertisements

An Introduction to Studying Expression Data Through RNA-seq
IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
V Improvements to 3kb Long Insert Size Paired-End Library Preparation Naomi Park, Lesley Shirley, Michael Quail, Harold Swerdlow Wellcome Trust Sanger.
Next–generation DNA sequencing technologies – theory & practice
Processing of miRNA samples and primary data analysis
SOLiD Sequencing & Data
Next-generation sequencing
High Throughput Sequencing
Next generation sequencing Why? What? How? Marcel Dinger Developmental Biology Divisional Seminar 7 October 2010.
CS 6293 Advanced Topics: Current Bioinformatics
Next Generation DNA Sequencing Platforms: Evolving Tools for
NGS Data Generation Dr Laura Emery. Overview The NGS data explosion Sequencing technologies An example of a sequencing workflow Bioinformatics challenges.
Update on Next-Generation Sequencing
Molecular Biology Dr. Chaim Wachtel April 4, 2013.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Sequencing Technologies and Applications at JGI
Introduction to next generation sequencing Rolf Sommer Kaas.
MES Genome Informatics I - Lecture IV. NGS basics Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University.
Ion Torrent and Minion Relatively low cost ‘next generation’ sequencing Wendy Smith School of Computing Science, Alan Ward Newcastle University, UK.
Next Generation DNA Sequencing
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Quick introduction to genomic file types Preliminary quality control (lab)
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
HaloPlexHS Get to Know Your DNA. Every Single Fragment.
Molecular Biology Dr. Chaim Wachtel May 28, 2015.
Genomics.
Genomics Core Facility at UNH: High-Throughput Sequencing on the Illumina HiSeq 2500 Platform Project Consultation Sample Submission Library Creation Illumina.
De Novo Genome Assembly - Introduction
Third Generation Sequencing. Today Illumina – Solexa sequencing technology 454 Life sciences – 454 sequencer Applied Biosystem – SOLiD system Tomorrow.
Canadian Bioinformatics Workshops
Library QA & QC Day 1, Video 3
Statistical Genomics Zhiwu Zhang Washington State University Lecture 6: Genotype.
Introduction to Illumina Sequencing
An Overview of Applications for the MiSeq and HiSeq 2500 April 4, 2016 Kevin Shianna, Ph.D. Sequencing Specialist - Illumina, Inc. MGC USERS GROUP.
Next-generation sequencing technology
Research Techniques Made Simple: Next-Generation Sequencing:
DNA Sequencing Second generation techniques
Short Read Sequencing Analysis Workshop
Next generation sequencing
The NGS Era is Now Eric T. Weimer, PhD, D(ABMLI)
Lecture 6: Genotype by sequencing
Sequencing technologies
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
Next-generation sequencing technology
Introduction to RAD Acropora millepora.
Sequencing technology and assembly
The FASTQ format and quality control
Next Generation Sequencing
Sequencing Technologies
Lecture 6: Genotype by sequencing
Small RNA Sample Preparation
CHAPTER 12 DNA Technology and the Human Genome
2nd (Next) Generation Sequencing
The characterisation of mtDNA deletions using long-read sequencing
Quality control for Sequencing Experiments
ULTRASEQUENCING. Next Generation Sequencing: methods and applications.
Molecular Diagnosis of Autosomal Dominant Polycystic Kidney Disease Using Next- Generation Sequencing  Adrian Y. Tan, Alber Michaeel, Genyan Liu, Olivier.
High-Throughput Sequencing Technologies
High-Throughput Sequencing Technologies
Canadian Bioinformatics Workshops
(Top) Construction of synthetic long read clouds with 10× Genomics technology. (Top) Construction of synthetic long read clouds with 10× Genomics technology.
Standard (Sanger) sequencing
BF528 - Sequence Analysis Fundamentals
Genomic DNA Sample Preparation
Toward Accurate and Quantitative Comparative Metagenomics
Global Next Generation Sequencing (NGS) Market (By Products - Consumables, Platforms, Services, Sequencing Services, Bioinformatics, Technology, Applications, End Users, Regions), Key Company Profiles - Forecast to 2025
Presentation transcript:

BF528 - 2nd (Next) Generation Sequencing

Introduction Methods for obtaining nucleotide sequence information: Sanger sequencing (i.e. 1st generation) sensitive, slow, low-throughput, expensive Microarray (not sequencing technology) cheap, high-throughput, fast, known sequences only We want to obtain information on many sequences quickly, cheaply, with high confidence, and without knowing those sequences beforehand

High-throughput Sequencing High-throughput: many nucleotides very quickly Sequencing: determine the linear sequence of nucleotides Use biochemical or biophysical approaches Current technologies vary by: Number of individual sequences generated Length of sequences Confidence in sequences

Evolution of NGS Surya Saha, Boyce Thompson Institute, Ithaca, NY (BTI plant bioinformatics course)

Sequencing machines Expensive to purchase (hundreds of thousands $USD) Expensive to operate (e.g. reagents, flow cells) You can sequence your genome at 30X depth for <$1000 USD. Roche 454 Ion Torrent Illumina HiSeq 2500 Illumina NovaSeq 6000

Experiment Alex Sanchez, Statistics and Bioinformatics Research Group, Statistics Department, Universitat de Barcelona

Comparison Quail, Michael A., et al. "A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers." BMC genomics13.1 (2012): 341.

Illumina Sequencing Most common sequencing technology today Sequences any DNA Sequencing by synthesis method Sequences (reads) are short (<300bp) 2 gigabases - 6 terabases per run Hours to days to complete one run

Illumina Sequencing Process https://www.youtube.com/watch?v=fCd6B5HRaZ8

Illumina Sequencing Properties Sequencing occurs on a flow cell Each flow cell has 1 to 8 lanes Number of reads for overall flow cell varies Length of reads is fixed (e.g. 250 bp) Read format: Single end - one read per molecule Paired end - two reads per molecule Multiplexing: sequence many samples at once using molecular barcodes NovaSeq Flowcell Courtesy of Illumina, Inc.

Sequencing Library Generation Workflow Sequencing Library: DNA prepared for sequencing Workflow: Extract RNA/DNA from sample If RNA, reverse transcribe to cDNA Size select using gel cut or random shearing PCR amplify DNA if concentration is low Add sequencing adapters If multiplexing, use barcoded adapters Pool samples, load across flow lanes for sequencing Typically only perform 1, sequencing cores do the rest

Design Choice: Fragment Length Illumina sequencers can only sequence DNA fragments up to ~300nt long DNA must be size-selected, usually by gel cut ~200-300nt band cut, purified, prepared for sequencing Fragment length follows a normal distribution around target cut size

Design Choice: Single End vs Paired End

Design Choice: Number of Reads Each sequencing run generates a # of total reads # of reads/sample ~= # total reads/number of samples # of reads for one sample: library size Choose target library size based on: Desired depth Desired coverage For more see https://genohub.com/recommended-sequencing-coverage-by-application/

Critical Concept: Read Mapping Question: “Given a read and a reference sequence, where, if anywhere, in the reference does the read sequence occur?” E.g. chr3:2,358,092-2,358,193 More on this next lecture

Mapped Read Terminology Genome Locus Depth: number of sequenced bases that map to a given location Mapped or Aligned reads Coverage: fraction of genomic locus covered by at least one read

Coverage - Whole Genome Sequenceing Good coverage Bad coverage

Dark matter example chr22:11M-12M RepeatMasker Gap

Sequence Data Format: fastq Sequence reads provided in fastq format For each read there are 4 lines: @read_header comment read_sequence +[quality_header] phred_quality_scores Phred scores estimate the probability that a base is called incorrectly

fastq format start new read @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR1997412.2 2 length=125 GTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTNNNNNN +SRR1997412.2 2 length=125 BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFBBFBBBFF######

fastq format unique read header @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR1997412.2 2 length=125 GTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTNNNNNN +SRR1997412.2 2 length=125 BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFBBFBBBFF######

fastq format comments separated by space, could be anything @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR1997412.2 2 length=125 GTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTNNNNNN +SRR1997412.2 2 length=125 BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFBBFBBBFF######

fastq format Sequence of the read @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR1997412.2 2 length=125 GTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTNNNNNN +SRR1997412.2 2 length=125 BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFBBFBBBFF######

fastq format start quality line @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR1997412.2 2 length=125 GTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTNNNNNN +SRR1997412.2 2 length=125 BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFBBFBBBFF######

fastq format repeat read header and comment, not required, often blank @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR1997412.2 2 length=125 GTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTNNNNNN +SRR1997412.2 2 length=125 BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFBBFBBBFF######

fastq format Quality sequence of the read, in ASCII @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR1997412.2 2 length=125 GTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTNNNNNN +SRR1997412.2 2 length=125 BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFBBFBBBFF######

fastq format Next read @SRR1997412.1 1 length=125 NTTGTAGCTGAGGAAACTGAGGCTCAGGAGGACAAGTGGCCTGCCAAAGGTACCAGCACTCAGATGGAATGGTTTTGAACTCAGTCCA +SRR1997412.1 1 length=125 #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR1997412.2 2 length=125 GTATTTAGTCATGTAAGACTCCTTAACCAGCTAACTTAAGAAAGACTTCTAGGACAGAATAGGTTACACTAGTTATAATTTTNNNNNN +SRR1997412.2 2 length=125 BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFBBFBBBFF###### Next read

Sequence Quality Score - Phred Each read base has a corresponding quality score Score indicates probability of base being wrong Scores quantized to integers, e.g. [-3,41] e.g. e = 0.0123, Q = 19.1 = 19 Q then mapped to ASCII space by adding offset e.g. 19+64 = 83, 83 == ‘S’ in ASCII For more info: https://en.wikipedia.org/wiki/FASTQ_format

Public data and platforms NCBI (https://www.ncbi.nlm.nih.gov/sra) Illumina basespace (https://basespace.illumina.com/home/index) Google genomics cloud (https://console.cloud.google.com/genomics/) Genome In A Bottle (GIAB) (http://jimb.stanford.edu/giab/) REPOSITIVE (https://discover.repositive.io/datasets/) GDC (https://portal.gdc.cancer.gov/) Seven Bridges (https://igor.sbgenomics.com/)

NCBI SRA portal

Illumina BaseSpace

3rd Generation Sequencing: PacBio Pacific Biosciences (PacBio) “Long read” “Single molecule”

3rd Gen Sequencing: Oxford Nanopore “Real time single molecule” https://www.youtube.com/watch?v=GUb1TZvMWsw