Henrik Lantz - NBIS/SciLife/Uppsala University

Slides:



Advertisements
Similar presentations
Genomics – The Language of DNA Honors Genetics 2006.
Advertisements

DNA Organization Lec 2. Aims The aims of this lecture is to investigate how cells organize their DNA within the cell nucleus, how is the huge amount of.
Chap. 6 Problem 2 Protein coding genes are grouped into the classes known as solitary (single) genes, and duplicated or diverged genes in gene families.
Describe the structure of a nucleosome, the basic unit of DNA packaging in eukaryotic cells.
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
DNA Sequencing. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Genomes summary 1.>930 bacterial genomes sequenced. 2.Circular. Genes densely packed Mbases, ,000 genes 4.Genomes of >200 eukaryotes (45.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Kinetics and Components
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Eukaryotic Genomes Demonstrate Sequence Organization Characterized by Repetitive DNA Honors Genetics Lemon Bay High School
Repetitive Elements May Comprise Over Two-Thirds of the Human Genome
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Genome Organization & Evolution. Chromosomes Genes are always in genomic structures (chromosomes) – never ‘free floating’ Bacterial genomes are circular.
Genomes & their evolution Ch 21.4,5. About 1.2% of the human genome is protein coding exons. In 9/2012, in papers in Nature, the ENCODE group has produced.
Used for detection of genetic diseases, forensics, paternity, evolutionary links Based on the characteristics of mammalian DNA Eukaryotic genome 1000x.
Chapter 21 Eukaryotic Genome Sequences
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
HUMAN GENOME Gene density 1/100 kb (vary widely); Averagely 9 exons per gene 363 exons in titin gene Many genes are intronsless Largest intron is 800.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
BB30055: Genes and genomes Genomes - Dr. MV Hejmadi Lecture 2 – Repeat elements.
Jan Pačes Institute of Molecular Genetics AS CR
Genomics and Forensics
Human Genome.
Lecture 10 Genes, genomes and chromosomes
De novo assembly validation
Biodiversity. Genetic Mutations Change in base pairs Affect sequence May affect protein production Can alter genetic makeup within species.
DNA Sequencing.
De Novo Genome Assembly - Introduction
Simple-Sequence Length Polymorphisms SSLPs Short tandemly repeated DNA sequences that are present in variable copy numbers at a given locus. Scattered.
QC and pre-assembly analyses
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
1 Junk DNA domestic imported domestic imported (e.g., dead genes) (e.g., retroviruses)
 DNA- genetic material of eukaryotes.  Are highly variable in size and complexity.  About 3.3 billion bp in humans.  Complexity- due to non coding.
Chromosome Organization & Molecular Structure. Chromosomes & Genomes Chromosomes complexes of DNA & proteins – chromatin Viral – linear, circular; DNA.
Simple-Sequence Length Polymorphisms
Working with the Human Genome
Organization of prokaryotic, eukaryotic and viral genomes
Thursday, March 2, 2017 GOALS: Finish Ghost in your Genes
Quality Control & Preprocessing of Metagenomic Data
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Pre-assembly analyses
Alu insert, PV92 locus, chromosome 16
Very important to know the difference between the trees!
Differences in DNA Heterochromatin vs. Euchromatin
SGN23 The Organization of the Human Genome
Sequences and their Properties
CS 598AGB Genome Assembly Tandy Warnow.
Genome structures.
DNA Polymorphisms: DNA markers a useful tool in biotechnology
What kinds of things have been learned?
Organisms are made up of cells, cells are largely protein and DNA carries the instructions for the synthesis of those proteins.
BSC1010: Intro to Biology I K. Maltz Chapter 21.
Chapter 6 Clusters and Repeats.
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
BF528 - Genomic Variation and SNP Analysis
Transposable Elements
BF528 - Whole Genome Sequencing and Genomic Variation
Human Transposon Tectonics
Henrik Lantz - NBIS/SciLifeLab/Uppsala University
Repetitive DNA sequences
Presentation transcript:

Henrik Lantz - NBIS/SciLife/Uppsala University Genome properties Henrik Lantz - NBIS/SciLife/Uppsala University

Organisms are different, and so are assembly projects

Heterozygosity levels Repeat-content GC-content Secondary structure Genome properties Genome size Heterozygosity levels Repeat-content GC-content Secondary structure Ploidy level

Genome sizes range from 100 kbp to 150 Gbp The larger the genome, the more data is needed to assemble it (>50x usually) Compute needs grow with increased amount of data (running time and memory) Note that larger genomes do not necessarily have to be harder to assemble, although empirically this is often the case

Heterozygosity (Slide by Torsten Seeman, Victorian Life Sciences Computation Initiative)

Highly heterozygous fungus (Zheng et al. (2013) Nature Com.)

Highly heterozygous regions tend to be assembled separately Heterozygosity Highly heterozygous regions tend to be assembled separately Homologous regions existing in multiple copies in the assembly Downstream problems in determining orthology for gene based analyses, comparative genomics etc.

Effect of heterozygosity on assembly size (Pryszcz and Gabaldon (2016) Nucl. Acids. res.)

Repeats Identical, or near identical, regions occurring in multiple copies in a genome (Istvan et al. (2011), PLoS ONE)

Repeats Low complexity regions Regions where some nucleotides are overrepresented, such as in homopolymers, e.g., AAAAAAAAAA, or slightly more complex, e.g., AAATAAAAAGAAAA Tandem repeats A pattern of one or more nucleotides repeated directly adjacent to each other, e.g., AGAGAGAGAGAGAGAGAGAG 2-5 nucleotides - microsatellites (e.g., GATAGATAGATA) 10-60 nucleotides - minisatellite Complex repeats (transposons, retroviruses, segmental duplications, rDNA, etc.)

How repeats can cause assembly errors Mathematically best result: C R A B

Repeat errors Collapsed repeats Overlapping non-identical reads and chimeras Overlapping non-identical reads Wrong contig order Inversions

When can I expect repeats to cause a problem? Always… Much more common in eukaryotes, in particular plants and many animals Several conifers have a repeat content of ~75%, mostly simple repeats -> huge genomes

How to deal with repeats Long range information, e.g., long reads or paired reads with long insert sizes R1 R2 Short reads

How to deal with repeats Long range information, e.g., long reads or paired reads with long insert sizes Long reads

Effect of insert size on scaffold length

These tools allow you find repeats de novo Repeat identifcation These tools allow you find repeats de novo Repeatexplorer Repeatmodeler REPET

Repeatmasker file name: FILTERED_4_111227_AD07GTACXX_B31_index7_1.sub500k.fa sequences: 500000 total length: 47417491 bp (47417491 bp excl N/X-runs) GC level: 45.49 % bases masked: 18112773 bp ( 38.20 %) ================================================== number of length percentage elements* occupied of sequence -------------------------------------------------- SINEs: 0 0 bp 0.00 % ALUs 0 0 bp 0.00 % MIRs 0 0 bp 0.00 % LINEs: 0 0 bp 0.00 % LINE1 0 0 bp 0.00 % LINE2 0 0 bp 0.00 % L3/CR1 0 0 bp 0.00 %

Repeatmasker LTR elements: 0 0 bp 0.00 % ERVL 0 0 bp 0.00 % ERVL-MaLRs 0 0 bp 0.00 % ERV_classI 0 0 bp 0.00 % ERV_classII 0 0 bp 0.00 % DNA elements: 0 0 bp 0.00 % hAT-Charlie 0 0 bp 0.00 % TcMar-Tigger 0 0 bp 0.00 % Unclassified: 218285 17781419 bp 37.50 % Total interspersed repeats: 17781419 bp 37.50 % Small RNA: 0 0 bp 0.00 % Satellites: 0 0 bp 0.00 % Simple repeats: 13539 656791 bp 1.39 % Low complexity: 0 0 bp 0.00 % ==================================================

GC-content Secondary structure Ploidy level Genome properties Regions of low or high GC-content have a lower coverage (Illumina, not PacBio) Secondary structure Regions that are tightly bound get less coverage Ploidy level On higher ploidy levels you potentially have more alleles present

Additional complexity Size of organism Hard to extract enough DNA from small organisms Pooled individuals Increases the variability of the DNA (more alleles) Inhibiting compounds Lower coverage and shorter fragments Presence of additional genomes/contamination Lower coverage of what you actually are interested in, potentially chimeric assemblies