Henrik Lantz - NBIS/SciLifeLab/Uppsala University

Slides:



Advertisements
Similar presentations
DNA Extraction Outline Purpose of DNA extraction
Advertisements

Chap. 6 Problem 2 Protein coding genes are grouped into the classes known as solitary (single) genes, and duplicated or diverged genes in gene families.
DNA, Chromosomes By Dr. : Naglaa Mokhtar. DNA Structure.
Extraction of Nucleic Acids (Genomic DNA, mRNA and Plasmid DNA)
Describe the structure of a nucleosome, the basic unit of DNA packaging in eukaryotic cells.
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
DNA fingerprinting Every human carries a unique set of genes (except twins!) The order of the base pairs in the sequence of every human varies In a single.
Some new sequencing technologies. Molecular Inversion Probes.
Henrik Lantz - BILS/SciLife/Uppsala University
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Genomic DNA purification
Extraction of Human DNA
Polymerase Chain Reaction
© Wiley Publishing All Rights Reserved.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
De-novo Assembly Day 4.
Spectrophotometry August 2011 SLCC/UVU STEP grant workshop.
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
The iPlant Collaborative
Used for detection of genetic diseases, forensics, paternity, evolutionary links Based on the characteristics of mammalian DNA Eukaryotic genome 1000x.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Main Idea #4 Gene Expression is regulated by the cell, and mutations can affect this expression.
Genomics.
From the Seed Sample to DNA II: DNA Isolation, Quantification, & Normalization Beni Kaufman.
Sequencing Kristian Stevens Mark Crepeau Charis Cardeno Charles H. Langley University of California, Davis Evolution.
Human Genome.
The Polymerase Chain Reaction (DNA Amplification)
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
De novo assembly validation
De Novo Genome Assembly - Introduction
Simple-Sequence Length Polymorphisms SSLPs Short tandemly repeated DNA sequences that are present in variable copy numbers at a given locus. Scattered.
QC and pre-assembly analyses
How many genes are there?
DNA marker analysis Mrs. Stewart Medical Interventions Central Magnet School.
Estimation of Nucleic Acid NAHLA BAKHAMIS. 1.Agarose Gel Electrophoresis: Separation and analysing DNA of varying sizes, by moving –ve charge na through.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Canadian Bioinformatics Workshops
Title: Studying whole genomes Homework: learning package 14 for Thursday 21 June 2016.
Library QA & QC Day 1, Video 3
Estimation of quantity and quality of isolated DNA
De Novo Genome Assembly - Introduction
16S rRNA Experimental Design
Risheng Chen et al BMC Genomics
SPECTROPHOTOMETRY (Quantification of Nucleic Acids)
Simple-Sequence Length Polymorphisms
Southern blot.
Polymerase Chain Reaction
Lesson: Sequence processing
Quality Control & Preprocessing of Metagenomic Data
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
TYPES OF ISOLATION.
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
15.2, slides with notes to write down
Pre-assembly analyses
Research in Computational Molecular Biology , Vol (2008)
The Difference of the Genomic DNA Extraction Between Animal & Plant
Very important to know the difference between the trees!
Relationship between Genotype and Phenotype
Relationship between Genotype and Phenotype
Relationship between Genotype and Phenotype
Polymerase Chain Reaction
Henrik Lantz - NBIS/SciLife/Uppsala University
Small RNA Sample Preparation
2nd (Next) Generation Sequencing
BF528 - Genomic Variation and SNP Analysis
BF nd (Next) Generation Sequencing
Canadian Bioinformatics Workshops
Relationship between Genotype and Phenotype
Presentation transcript:

Henrik Lantz - NBIS/SciLifeLab/Uppsala University Genome properties Henrik Lantz - NBIS/SciLifeLab/Uppsala University

Organisms are different, and so are assembly projects

Heterozygosity levels Repeat-content GC-content Secondary structure Genome properties Genome size Heterozygosity levels Repeat-content GC-content Secondary structure Ploidy level

Genome sizes range from 100 kbp to 150 Gbp The larger the genome, the more data is needed to assemble it (>50x usually) Compute needs grow with increased amount of data (running time and memory) Note that larger genomes do not necessarily have to be harder to assemble, although empirically this is often the case

Heterozygosity (Slide by Torsten Seeman, Victorian Life Sciences Computation Initiative)

Highly heterozygous fungus (Zheng et al. (2013) Nature Com.)

Highly heterozygous regions tend to be assembled separately Heterozygosity Highly heterozygous regions tend to be assembled separately Homologous regions existing in multiple copies in the assembly Downstream problems in determining orthology for gene based analyses, comparative genomics etc.

Effect of heterozygosity on assembly size (Pryszcz and Gabaldon (2016) Nucl. Acids. res.)

Repeats Identical, or near identical, regions occurring in multiple copies in a genome (Istvan et al. (2011), PLoS ONE)

Repeats Low complexity regions Regions where some nucleotides are overrepresented, such as in homopolymers, e.g., AAAAAAAAAA, or slightly more complex, e.g., AAATAAAAAGAAAA Tandem repeats A pattern of one or more nucleotides repeated directly adjacent to each other, e.g., AGAGAGAGAGAGAGAGAGAG 2-5 nucleotides - microsatellites (e.g., GATAGATAGATA) 10-60 nucleotides - minisatellite Complex repeats (transposons, retroviruses, segmental duplications, rDNA, etc.)

How repeats can cause assembly errors Mathematically best result: C R A B

Repeat errors Collapsed repeats Overlapping non-identical reads and chimeras Overlapping non-identical reads Wrong contig order Inversions

When can I expect repeats to cause a problem? Always… Much more common in eukaryotes, in particular plants and many animals Several conifers have a repeat content of ~75%, mostly simple repeats -> huge genomes

How to deal with repeats Long range information, e.g., long reads or paired reads with long insert sizes R1 R2 Short reads

How to deal with repeats Long range information, e.g., long reads or paired reads with long insert sizes Long reads

GC-content Secondary structure Ploidy level Genome properties Regions of low or high GC-content have a lower coverage (Illumina, not PacBio) Secondary structure Regions that are tightly bound get less coverage Ploidy level On higher ploidy levels you potentially have more alleles present

Additional complexity Size of organism Hard to extract enough DNA from small organisms Pooled individuals Increases the variability of the DNA (more alleles) Inhibiting compounds Lower coverage and shorter fragments Presence of additional genomes/contamination Lower coverage of what you actually are interested in, potentially chimeric assemblies

Henrik Lantz - NBIS/SciLifeLab/Uppsala University Project planning Henrik Lantz - NBIS/SciLifeLab/Uppsala University

Sequencing technology comparison Sequencing system Read length Yield Illumina Hi-Seq 2500 2x125 bp 180 M read pairs/lane, 28 Gbp/lane Illumina HiSeqX 2x150 bp 350 M read pairs/lane, 78Gbp/lane Illumina NovaSeq 6000 2000 M read pairs/lane Illumina MiSeq Up to 2x500 bp 18 M read pairs/lane, 7.4Gbp/run PacBio RSII 250-70000 bp 1 Gb/SMRTcell PacBio Sequel 250-300000 bp ~10 Gb/SMRTcell Oxford Nanopore 500-1000000 bp 148 Gb/Flow cell

Error rates and types Sequencing system Error type Error rate Illumina Substitutions 0.1% PacBio Insertions 0.001-12% depending on read length Oxford Nanopore Substitutions, indels 15%

De novo genome project workflow Plan your project! Extract DNA (and RNA) Choose best sequence technology for the project Sequencing Quality assessment and other pre-assembly investigations Assembly Assembly validation Assembly comparisons Repeat masking? Annotation

De novo genome project workflow Plan your project!

What do you want to achieve? Quality Fully assembled and phased genome, full gene space Draft genome, split in longer repeats, complete genes, almost full gene space Draft genome, highly fragmented, split genes, partial gene space Effort and cost

Pilot project? One lane of Illlumina data is cheap and can be used to investigate genome size, presence of contaminants, and more. Long read technologies are sensitive to DNA quality issues. Trying several extractions before deciding which one to use can be a good idea. An extraction that gives good QC values (fragment sizes, absorption rates, etc.), can still fail in sequencing!

Estimate computing resources (Dominguez del Angel et al., 2018)

Estimate computing resources What tools do you want to run? Assembly can be memory-intense. Polishing can also require a lot of memory. A normal Rackham node might not be enough. Do you have the necessary storage space? Can you run your tools on several nodes over MPI?

Henrik Lantz - NBIS/SciLifeLab/Uppsala University DNA extraction Henrik Lantz - NBIS/SciLifeLab/Uppsala University

Example

Causes of DNA degradation Mechanical damage during tissue homogenization. Wrong pH and ionic strength of extraction buffer. Incomplete removal / contamination with nucleases. Phenol: too old, or inappropriately buffered (pH 7.8 – 8.0); incomplete removal. Wrong pH of DNA solvent (acidic water). Recommended: 1:10 TE for short-term storage, or 1xTE for long-term storage. Vigorous pipetting (wide-bore pipet tips). Vortexing of DNA in high concentrations. Too many freeze-thaw cycles (we tested 5, still Ok). Debatable: sequence-dependent

What are the main contaminants? Polysaccharides Lypopolysaccharides Growth media residuals Chitin Protein Secondary metabolites Pigments Growth media residuals Chitin Fats Proteins Pigments Polyphenols Polysaccharides Secondary metabolites Pigments By Olga Vinnere Pettersson, Uppsala Genome Center, SciLifeLab

What do absorption ratios tell us? Pure DNA 260/280: 1.8 – 2.0 < 1.8: Too little DNA compared to other components of the solution; presence of organic contaminants: proteins and phenol; glycogen - absorb at 280 nm. > 2.0: High share of RNA. Pure DNA 260/230: 2.0 – 2.2 <2.0: Salt contamination, humic acids, peptides, aromatic compounds, polyphenols, urea, guanidine, thiocyanates (latter three are common kit components) – absorb at 230 nm. >2.2: High share of RNA, very high share of phenol, high turbidity, dirty instrument, wrong blank. Photometrically active contaminants: phenol, polyphenols, EDTA, thiocyanate, protein, RNA, nucleotides (fragments below 5 bp) By Olga Vinnere Pettersson, Uppsala Genome Center, SciLifeLab

DNA quality requirements Some DNA left in the well Sharp band of 20+kb No sign of proteins No smear of degraded DNA No sign of RNA NanoDrop: 260/280 = 1.8 – 2.0 260/230 = 2.0 – 2.2 Qubit or Picogreen: 10 kb insert libraries: 3-5 ug 20 kb insert libraries: 10-20 ug

De novo genome project workflow Plan your project! Extract DNA (and RNA) Extract much more DNA than you think you need Also remember to extract RNA for the annotation Single individual and haploid tissue if possible In particular for Illumina mate-pairs data and PacBio, a lot of high molecular weight DNA is critical! Extracting DNA for de novo assembly is very different from extractions intended for PCR Do several extractions if possible, and run them on a gel to get an idea of how fragmented the DNA is Try to remove contaminants from the extractions

Effect of insert size on scaffold length (Treangen and Salzberg, 2013)