Pre-assembly analyses

Slides:



Advertisements
Similar presentations
Homology Based Analysis of the Human/Mouse lncRNome
Advertisements

BLAST Sequence alignment, E-value & Extreme value distribution.
RNA Assembly Using extending method. Wei Xueliang
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Genome Annotation BCB 660 October 20, From Carson Holt.
Sequencing a genome and Basic Sequence Alignment
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
De-novo Assembly Day 4.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Probes can be designed in an evolutionary hierarchy.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
Unit-5 Automated Comparison. VERIFICATION Verification and Validation are independent procedures that are used together for checking that a product, service,
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
Sequencing a genome and Basic Sequence Alignment
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
De novo assembly validation
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Denovo Sequencing Practical. Overview Very small dataset from Staphylococcus aureus – 4 million x 75 base-pair, paired end reads Cover basic aspects of.
QC and pre-assembly analyses
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
….. The cloud The cluster…... What is “the cloud”? 1.Many computers “in the sky” 2.A service “in the sky” 3.Sometimes #1 and #2.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Canadian Bioinformatics Workshops
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
Simon v RNA-Seq Analysis Simon v
Virginia Commonwealth University
Sequencing, de novo assembling, and annotating the genome of the endangered Chinese crocodile lizard, shinisaurus crocodilurus Jian gao, qiye li, zongji.
Computing challenges in working with genomics-scale data
Lesson: Sequence processing
MGmapper A tool to map MetaGenomics data
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Reconstructing the Evolutionary History of Complex Human Gene Clusters
EDNA analyze Wang Ying & Huang Junman.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Genome sequence assembly
The FASTQ format and quality control
Very important to know the difference between the trees!
Kallisto: near-optimal RNA seq quantification tool
Transcriptome Assembly
CSE182-L12 Gene Finding.
Henrik Lantz - NBIS/SciLife/Uppsala University
CS 598AGB Genome Assembly Tandy Warnow.
2nd (Next) Generation Sequencing
How to Build a Horse: Final Report
Exploring and Understanding ChIP-Seq data
BLAST.
Maximize read usage through mapping strategies
Independent scientist
Basic Local Alignment Search Tool
BF528 - Sequence Analysis Fundamentals
Roye Rozov Shamir group meeting 3/7/13
Henrik Lantz - NBIS/SciLifeLab/Uppsala University
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Pre-assembly analyses bjorn.nystedt@scilifelab.se Facility manager, SciLifeLab Bioinformatics Long-term Support (a.k.a WABI) Credits to Doug Scofield, Nat Street, Francesco Vezzi, Amaryllis Vidali, Andrea Zuccolo and others!

Two types of assemblies Case 1 : Flycatcher (1.2 Gbp) Herring (800 Mbp) Malassezia (7 Mbp)

Two types of assemblies Case 1 : Flycatcher (1.2 Gbp) Herring (800 Mbp) Malassezia (7 Mbp) Case 2 : Spruce (20 Gbp) Barnacle (1.4 Gbp) Wolbachia (4 Mbp)

Just a word on N50… N50 typically refers to a contig (or scaffold) length But… The original definition is the number of contigs needed to reach half of the genome size (L50 is the length) Many programs use the total assembly size as a proxy for the genome size; this is sometimes completely misleading: Use NG50! PI:s don’t understand N50 anyway; use something more intuitive : - contigs larger than 1 kbp sum to 93% of the genome size - contigs larger than 10 kbp sum to 48% of the genome size - contigs larger than 100 kbp sum to 19% of the genome size N50 NG50 Assembly size Genome size Genome Assembly 3 contigs 100 kbp 5 contigs 30 kbp

Why is it hard?

The devil is in the repeats Mathematically best result: C R A B

Repeat errors Collapsed repeats Overlapping non-identical reads and chimeras Overlapping non-identical reads Wrong contig order Inversions

It’s getting worse A: ATCGGGTATATAG-CCTA ||||||| || || |||| B: ATCGGGTGTACAGCCCTA A ? A & B B

…and humans are easy. Bacteria, archaea, fungi, some plants Most animals, some plants Many plants Also: Heterozygozity is generally very low in mammals; most other species are much harder

Pre assembly Quality trimming (Error correction) Kmer analysis De novo repeat library

Quality trimming DeBruijn-graph assemblers are in principle sensitive to errors since they do not take base quality values into account Trim adapters (e.g. Cutadapt) Filter on quality, both 5’ and 3’ end! (e.g. Trimmomatic) Consider hard-trimming of 5’ end Error correction (e.g. Quake) Inspect (e.g. FastQC) Plots by Olof Karlberg

Reads vs kmers …….. 1 read: 100 bp Kmers: k=21bp N= (L – k + 1) (100bp – 21 bp + 1) 80 …….. Base coverage * (L-k+1) = Kmer coverage L Ex: 50X * (100-21+1) = 40X (i.e. kmer coverage is 80% of base coverage) 100

Kmer analyses Compute the frequency of each kmer in the dataset (e.g. Jellyfish --both-strands) Note: RAM-intense!

Digging into the kmers Genome size Remove low-copy kmers Identify the coverage peak Divide total nb of kmers by peak “Cpeak 20 million distinct kmers occure 55 times in all reads combined” Genome size = Ktot/Cpeak Here: 1.4 Gbp = 80 G / 55 Note: Ktot = Nb reads * (L-k+1) Base coverage = Cpeak (L-k+1)/L Here: 69X = 55 (100 – 21 +1)/100

Repeats: first shot The nb of distinct kmers in the single-copy peak corresponds roughly to the single-copy genome size Single-copy Example Beetle: 0.75 Gbp is single-copy, so almost 40% of the 1.2 Gbp genome is repeated (kmer=27) Repeats

Heterozygocity Double peak in the kmer histogram; clear indication of heterozygocity Not entirely easy to quantify (although attempts have been made)

A word on quality filtering… Light QC filter Hard QC filter

Repeat library and repeat quantification Create a de novo repeat library Run a low-coverage (e.g. 0.1X) assembly (e.g. RepeatExplorer or Trinity) Filter contaminants and mito/chloro [ Make non-redundant (e.g. Cdhit) ] Quantify the (high) repeat content by an independent subset of reads - Mapping (e.g. bwa), or - Mask with RepeatMasker

Repeat library from low coverage data Sparse seq data Overlaps?

Repeat library from low coverage data Sparse seq data Overlaps? Assembled contigs

Repeat library from low coverage data Sparse seq data Overlaps? Assembled contigs Warning! Beware of contaminations, plastids etc

Quantify your repeat seqs Independent set of sparse data Screen reads with repeat seqs 33% of all bases in the reads are covered by repeat seqs  33% of the genome is “repeated” Warning! The quantification depends heavily on the size of the original read set

Classifying repeats LTR Gypsy/Copia LINE/SINE Getting tricky… DNA elements … Getting tricky… Classifying the repeat library directly RepeatMasker Repeat protein domain serach (http://www.repeatmasker.org/cgi-bin/RepeatProteinMaskRequest) Problems No close homologs in databases Rapid evolution of repeats (like transposable elements) Non-autonomous TE:s do not contain proteins Solutions Fetch intact ORF:s from hits in assembly Extend assembly matches and get more complete elements Check match alignment profiles in assembly (LINES conserved at 3’ end but not at 5’..) => Often slow, manual, species-specific solutions

Take home Genome assembly is sometimes reasonably easy, if you are lucky and not too picky. There are tools to indicate which one you are up against. Adapters and quality trimming is a pain in the neck. But you should probably do it. Unless you use ALLPATHS-LG Genome size and repeat content can (often better!) be estimated without an assembly

Thanks