Download presentation
Presentation is loading. Please wait.
1
High Throughput Sequencing
2
Agenda Introduction to sequencing Applications
Bioinformatics analysis pipelines What should you ask yourself before planning the experiment
3
Introduction to sequencing
4
What is sequencing? Finding the sequence of a DNA/ RNA molecule
What can we sequence?
5
Sanger sequencing Up to 1,000 bases molecule One molecule at a time
Widely used from First human genome draft was based on Sanger sequencing Still in use for single molecules
6
High Throughput Sequencing
Next Generation Sequencing (NGS) / Massively parallel sequencing Sequencing millions of molecules in parallel Do not need prior knowledge of what you’re sequencing Platform Read length No. of reads per run 454 sequencing Up to 1,000 bp ~1 M (1 million) SOLiD 50-75 bp ~1 G (1 billion) HiSeq bp ~0.5 G We will discuss Illumina’s platform only
7
Sequencing Workflow Extract tissue cells Extract DNA/RNA from cells
Sample preparation for sequencing Sequencing Bioinformatics analysis Why is it important to understand the “wet lab” part?
8
Sample Prep Random shearing of the DNA Adding adaptors and barcodes
Size selection Amplification Sequencing
9
Sequencing process
10
Leave only sequences from one direction
Sequencing process Leave only sequences from one direction
11
Sequencing process
12
Sequencing process
13
Sequencing process
14
Applications
15
DNA sequencing Resequencing – sequencing the genome of an organism with a known genome Exome sequencing / Targeted sequencing – sequencing only selected regions from the genome De-novo sequencing– sequencing the genome of an organism with a unknown genome
16
RNA-Seq Sequencing of mRNA extracted form the cell to get an estimate of expression levels of genes.
17
RNA-Seq vs. DNA-sequencing
Counting vs. Reading RNA-Seq vs. DNA-sequencing
18
ChIP-Seq Sequencing the regions in the genome to which a protein binds to.
19
Basic concepts Insert – the DNA fragment that is used for sequencing.
Read – the part of the insert that is sequenced. Single Read (SR) – a sequencing procedure by which the insert is sequenced from one end only. Paired End (PE) – a sequencing procedure by which the insert is sequenced from both ends.
20
Bioinformatics Analysis Pipelines
21
Demultiplexing Lane Unknown inserts
22
Mapping parameters affect the rest of the analysis
Demultiplexing Sample Mapping Reference Genome Example of mapping parameters: Number of mismatches per read Scores for mismatch or gaps Mapping parameters affect the rest of the analysis
23
Removing duplicates and non-unique mappings
Demultiplexing Mapping Reference Genome Removing duplicates and non-unique mappings Reference Genome ? 𝒂𝒗𝒆𝒓𝒂𝒈𝒆 𝒄𝒐𝒗𝒆𝒓𝒂𝒈𝒆= 𝑟𝑒𝑎𝑑 𝑙𝑒𝑛𝑔𝑡ℎ ⋅ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑎𝑑𝑠 ⋅ % 𝑢𝑛𝑖𝑞𝑢𝑒𝑙𝑦 𝑚𝑎𝑝𝑝𝑒𝑑 𝑟𝑒𝑎𝑑𝑠 𝑔𝑒𝑛𝑜𝑚𝑒 𝑠𝑖𝑧𝑒
24
Resequencing/ Exome Pipeline
25
Removing duplicates and non-unique mappings
Demultiplexing Mapping Removing duplicates and non-unique mappings Coverage profile and variant calling Reference Genome …ACTTCGTCGAAAGG… G
26
Removing duplicates and non-unique mappings
Demultiplexing Frequency >= 20% Mapping Reference Genome …ACTTCGTCGAAAGG… Removing duplicates and non-unique mappings Coverage profile and variant calling Coverage >= 5 Variant filtering Reference Genome …ACTTCGTCGAAAGG…
27
Removing duplicates and non-unique mappings Genes and known variants
Demultiplexing Mapping Removing duplicates and non-unique mappings Variant calling Reference Genome …ACTTCGTCGAAATG… …GTCCCGTGATACTCCGT… G A Variant filtering Genes and known variants rs230985 Gene X
28
Resequencing results
29
Example for further analysis
Recessive disease: Variant not in known databases Homozygous variant shared by all affected individuals Same variant appears in healthy parents at heterozygous state Healthy brothers can be heterozygous to the same variant Demultiplexing Mapping Removing duplicates and non-unique mappings Coverage profile and variant calling Dominant disease: Variant not in known databases Heterozygous variant shared by all affected individuals The variant doesn’t appear in healthy individuals Variant filtering Genes and known variants 1 Table per sample Finding suspicious variants 1 Table per project
30
Quality control steps in the pipeline
Demultiplexing QC Mapping QC Removing duplicates and non-unique mappings QC Coverage profile and variant calling QC Variant filtering QC Genes and known variants Finding suspicious variants
31
How is de-novo assembly different from resequencing analysis ?
32
RNA-Seq Pipeline
33
Removing duplicates and non-unique mappings Gene expression levels
FPKM Normalization: 𝑟𝑎𝑤 𝑐𝑜𝑢𝑛𝑡 𝑔𝑒𝑛𝑒 𝑙𝑒𝑛𝑔𝑡ℎ⋅𝑠𝑎𝑚𝑝𝑙 𝑒 ′ 𝑠 𝑚𝑎𝑝𝑝𝑒𝑑 𝑟𝑒𝑎𝑑𝑠 𝑖𝑛 𝑚𝑖𝑙𝑙𝑖𝑜𝑛𝑠 Demultiplexing Mapping Removing duplicates and non-unique mappings Gene expression levels Unannotated genes
34
Differential expression parameters:
Demultiplexing Mapping Differential expression parameters: Threshold - Minimum number of reads for pair testing Normalization Replicates Differential expression parameters affect the results Removing duplicates and non-unique mappings Gene expression levels Differential gene expression
35
RNA-Seq results
36
Coverage Coverage – resequencing “Coverage” – RNA-Seq
Number of bases that cover each base in the genome in average. 𝒂𝒗𝒆𝒓𝒂𝒈𝒆 𝒄𝒐𝒗𝒆𝒓𝒂𝒈𝒆= 𝑟𝑒𝑎𝑑 𝑙𝑒𝑛𝑔𝑡ℎ ⋅ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑎𝑑𝑠 ⋅ % 𝑢𝑛𝑖𝑞𝑢𝑒𝑙𝑦 𝑚𝑎𝑝𝑝𝑒𝑑 𝑟𝑒𝑎𝑑𝑠 𝑔𝑒𝑛𝑜𝑚𝑒 𝑠𝑖𝑧𝑒 “Coverage” – RNA-Seq Depends on the expression profile of each sample. Highly expressed genes will be detected with less “coverage” than lowly expressed genes.
37
What should you ask yourself before sequencing when planning the experiment
Reference genome: What is my reference genome? Does it have updated annotations? What annotations are known? Are my samples closely related to the reference genome? Do I expect to have contaminations in my sample? Do I have validations from other technologies? (RT-PCR, SNPchip…) Do I have controls and replicates? RNA-Seq: am I interested in alternative splicing? Resequencing: What kind of mutations do I expect to find?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.