QC and pre-assembly analyses

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
Homology Based Analysis of the Human/Mouse lncRNome
Programming Types of Testing.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
De-novo Assembly Day 4.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
CS 394C March 19, 2012 Tandy Warnow.
Update on HTProcess Apps Sciplant May 8, HTProcessPipeline Purpose- – Provide a more functional set of commonly needed applications for RNASeq and.
Todd J. Treangen, Steven L. Salzberg
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.
How I learned to quit worrying Deanna M. Church Staff Scientist, Short Course in Medical Genetics 2013 And love multiple coordinate.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Assembling Sequences Using Trace Signals and Additional Sequence Information Bastien Chevreux, Thomas Pfisterer, Thomas Wetter, Sandor Suhai Deutsches.
NGS data analysis CCM Seminar series Michael Liang:
Unit-5 Automated Comparison. VERIFICATION Verification and Validation are independent procedures that are used together for checking that a product, service,
The iPlant Collaborative
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Quality Control Hubert DENISE
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
De novo assembly validation
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
De Novo Genome Assembly - Introduction
The iPlant Collaborative
Denovo Sequencing Practical. Overview Very small dataset from Staphylococcus aureus – 4 million x 75 base-pair, paired end reads Cover basic aspects of.
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
De novo assembly of RNA Steve Kelly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
HOMER – a one stop shop for ChIP-Seq analysis
Canadian Bioinformatics Workshops
Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.
From Reads to Results Exome-seq analysis at CCBR
Virginia Commonwealth University
Sequencing, de novo assembling, and annotating the genome of the endangered Chinese crocodile lizard, shinisaurus crocodilurus Jian gao, qiye li, zongji.
Computing challenges in working with genomics-scale data
MGmapper A tool to map MetaGenomics data
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Short Read Sequencing Analysis Workshop
Transcriptomics II De novo assembly
Pre-assembly analyses
The FASTQ format and quality control
Workshop on Microbiome and Health
Henrik Lantz - NBIS/SciLife/Uppsala University
Predicting Active Site Residue Annotations in the Pfam Database
2nd (Next) Generation Sequencing
Results report: _roreriPE_AGTCAA_L008_R1_all. fastq
ChIP-Seq Data Processing and QC
Exploring and Understanding ChIP-Seq data
BLAST.
A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.
Maximize read usage through mapping strategies
Basic Local Alignment Search Tool
BF528 - Sequence Analysis Fundamentals
Henrik Lantz - NBIS/SciLifeLab/Uppsala University
Toward Accurate and Quantitative Comparative Metagenomics
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

QC and pre-assembly analyses Henrik Lantz, Mahesh Panchal - BILS/SciLife/Uppsala University

Important organism specific properties Genome size Repeat content Heterozygosity

Important organism specific properties Genome size - Large genomes require more data whoch requires more time and is more complex to analyse Repeat content - Reads from different repeats are identical and confound the algorithms Heterozygosity - Assemblers usually try to create a haploid consensus assembly but will create double assemblies of heterozygotic regions

The devil is in the repeats Mathematically best result: C R A B

Repeat errors Collapsed repeats Overlapping non-identical reads and chimeras Overlapping non-identical reads Wrong contig order Inversions

Preparing reads for assembly Integrity and format validation Adapter removal (Error correction) Kmer analysis Contamination removal

Data integrity and format Many tools cannot tell if the data is complete. Transferred data should have checksums e.g MD5. 823fc8b0ca72c6e9bd8c5dcb0a66ce9b file1.fastq.gz $ md5sum -c md5.txt file1.fastq.gz: OK file2.fastq.gz: OK file3.fastq.gz: FAILED md5sum: WARNING: 1 of 3 computed checksums did NOT match

Data integrity and format Inspect your fastq files $ zcat file1.fastq.gz | head @HWI-ST486:212:D0C8BACXX:6:1101:2365:1998 1:N:0:ATTCCT CTTATCGGATCGATCCCAGTTTGGGCTTGTAAACGGTGAATCCTCAAAGACCACCAATGTTG + CCCFFFFFHHHHHJJJJJJHIJIIJGGJGFEGIGHIBFGHJIJIICHIIIDHGGIGIGHEFG @HWI-ST486:212:D0C8BACXX:6:1101:2365:1998 2:N:0:ATTCCT TAACCGAGCAAACAAAAGTTGGTTGTCACAAATTGTAATGACCTGATTAAACTTGATTTTTT + CCCFFFFFHHHHHJIIIJHIJJHIJJJJJJJJJJJIJJJIJJJJJIIIJJIJJJJGIJJJJH zcat lets you look at gzip compressed files and bzcat at bzip2 compressed files.

FastQC is a first step to diagnose major errors. Basic inspection FastQC is a first step to diagnose major errors. $ module load bioinfo-tools FastQC/0.11.2 $ fastqc -t 6 *.fastq.gz Zhou and Rokas, 2014: Mol. Ecol.

FastQC Zhou and Rokas, 2014: Mol. Ecol.

FastQC Zhou and Rokas, 2014: Mol. Ecol.

Adapter read-through is common. Trimming adapters Adapter read-through is common. $ module load bioinfo-tools trimmomatic/0.32 $ TRIMAPP=/sw/apps/bioinfo/trimmomatic/0.32/milou/trimmomatic.jar $ ADAPTERFILE=adapters.fasta $ java -jar $TRIMAPP PE –threads 16 \ Sample034_Lane1_R1.fastq.gz \ Sample034_Lane1_R2.fastq.gz \ Sample034_Lane1_R1.clean.fastq.gz \ Sample034_Lane1_R1.unpaired.clean.fastq.gz \ Sample034_Lane1_R2.clean.fastq.gz \ Sample034_Lane1_R2.unpaired.clean.fastq.gz \ ILLUMINACLIP:$ADAPTERFILE:2:30:10 \ LEADING:3 TRAILING:3 MINLENGTH:50

Do your fastq files contain the same information? Detecting biases Do your fastq files contain the same information? Biases come from many sources Library preparation Contamination Machine error

Kmer analyses Compute the frequency of each kmer in the dataset Note: RAM-intense!

Kmer analyses module load bioinfo-tools KAT/2.0.6 gnuplot/4.6.5 OUTPUTDIR=$SNIC_TMP/kat_qc PROJDIR=$(pwd) mkdir -p $OUTPUTDIR cd $OUTPUTDIR for FASTQ in $( find $PROJDIR -name “*.fastq.gz”); do gzip -c $FASTQ > $(basename ${FASTQ%.gz}) done kat hist -t 32 -C -o all_data_hist *.fastq rm *.fastq cd $PROJDIR rsync -av $OUTPUTDIR .

Reads vs kmers …….. 1 read: 100 bp Kmers: k=21bp N= (L – k + 1) (100bp – 21 bp + 1) 80 …….. Base coverage * (L-k+1) = Kmer coverage L Ex: 50X * (100-21+1) = 40X (i.e. kmer coverage is 80% of base coverage) 100

Digging into the kmers Genome size Remove low-copy kmers Identify the coverage peak Divide total nb of kmers by peak “Cpeak 20 million distinct kmers occure 55 times in all reads combined” Genome size = Ktot/Cpeak Here: 1.4 Gbp = 80 G / 55 Note: Ktot = Nb reads * (L-k+1) Base coverage = Cpeak (L-k+1)/L Here: 69X = 55 (100 – 21 +1)/100

Repeats: first shot The nb of distinct kmers in the single-copy peak corresponds roughly to the single-copy genome size Single-copy Example Beetle: 0.75 Gbp is single-copy, so almost 40% of the 1.2 Gbp genome is repeated (kmer=27) Repeats

Heterozygosity Double peak in the kmer histogram; clear indication of heterozygosity Not entirely easy to quantify (although attempts have been made)

Do read 1 and read 2 have the same bias? Back to biases Do read 1 and read 2 have the same bias? Kmer Analysis Toolkit: A short walkthrough

Bias detection and kmer analyses Do read 1 and read 2 have the same content?

Bias detection and kmer analyses Are all your runs/libraries affected in the same way?

Bias detection and kmer analyses Do your runs/libraries contain the same data?

Kmer analyses # compare read 1 vs read 2 or lib A vs lib B # Density plot kat comp -p -t 16 -C -D -o $OUTPUT $FWDREAD $REVREAD # Spectra plot (must run density computation first) kat plot spectra-mx -n -o ${OUTPUT}_s.png $OUTPUT-main.mx # Compare GC content kat gcp -t 16 -C -o $GCOUT $ALLREADS

Error correction and digital normalization Digital normalization removes high frequency reads Error correction removes low frequency reads

Estimating repeat content Create a de novo repeat library Run a low-coverage (e.g. 0.1X) assembly (e.g. RepeatExplorer or Trinity) Filter contaminants and mito/chloro [ Make non-redundant (e.g. Cdhit) ] Quantify the (high) repeat content by an independent subset of reads Mapping (e.g. bwa), or Mask with RepeatMasker

Repeat library from low coverage data Sparse seq data Overlaps?

Repeat library from low coverage data Sparse seq data Overlaps? Assembled contigs

Repeat library from low coverage data Sparse seq data Overlaps? Assembled contigs Warning! Beware of contaminations, plastids etc

Quantify your repeat seqs Independent set of sparse data Screen reads with repeat seqs 33% of all bases in the reads are covered by repeat seqs  33% of the genome is “repeated” Warning! The quantification depends heavily on the size of the original read set

Classifying repeats LTR Gypsy/Copia LINE/SINE Getting tricky… DNA elements … Getting tricky… Classifying the repeat library directly RepeatMasker Repeat protein domain serach (http://www.repeatmasker.org/cgi-bin/RepeatProteinMaskRequest) Problems No close homologs in databases Rapid evolution of repeats (like transposable elements) Non-autonomous TE:s do not contain proteins Solutions Fetch intact ORF:s from hits in assembly Extend assembly matches and get more complete elements Check match alignment profiles in assembly (LINES conserved at 3’ end but not at 5’..) => Often slow, manual, species-specific solutions

Take home Genome assembly is sometimes reasonably easy, if you are lucky and not too picky. There are tools to indicate which one you are up against. Filtering data is generally a necessity, but steps depend highly on input. Unless you use ALLPATHS-LG, filter your data. Genome size and repeat content can (often better!) be estimated without an assembly

Thanks