Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.

Slides:



Advertisements
Similar presentations
MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
Advertisements

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Welcome to Introduction to Bioinformatics Wednesday, 10 February Genome Sequencing/Assembly Genome sequencing/Assembly Click anywhere to go on to the next.
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM.
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
De-novo Assembly Day 4.
Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.
Introduction to next generation sequencing Rolf Sommer Kaas.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
Next Generation DNA Sequencing
Quick introduction to genomic file types Preliminary quality control (lab)
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Jan Pačes Institute of Molecular Genetics AS CR
Quality Control Hubert DENISE
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
Sequencing Kristian Stevens Mark Crepeau Charis Cardeno Charles H. Langley University of California, Davis Evolution.
Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.
1.Data production 2.General outline of assembly strategy.
Sequence File Formats.
De Novo Genome Assembly - Introduction
Denovo Sequencing Practical. Overview Very small dataset from Staphylococcus aureus – 4 million x 75 base-pair, paired end reads Cover basic aspects of.
The Wellcome Trust Sanger Institute
QC and pre-assembly analyses
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
Adapter and quality trimming Mick Watson Director of ARK-Genomics The Roslin Institute.
Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
Case study: Saccharomyces cerevisiae grown under two different conditions RNAseq data plataform: Illumina Goal: Generate a platform where the user will.
User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
De-novo Bacterial draft genome de-novo asembly, from the sequencing machine (Illumina) to a genome database (NCBI) An example case: Assembly of Stenotrophomonas.
Canadian Bioinformatics Workshops
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.
De Novo Genome Assembly - Introduction
MGmapper A tool to map MetaGenomics data
Quality Control & Preprocessing of Metagenomic Data
Bacterial Genome Assembly
Gonzalo Riadi February, 2013 – December, 2015
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Denovo genome assembly of Moniliophthora roreri
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Professors: Dr. Gribskov and Dr. Weil
Sequencing technology and assembly
The FASTQ format and quality control
Introduction into the processing of raw data
Bacterial Genome Assembly
Henrik Lantz - NBIS/SciLife/Uppsala University
2nd (Next) Generation Sequencing
A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.
The ability of the SOP to sequence and identify unknown samples.
Garbage In, Garbage Out: Quality control on sequence data
Next-generation DNA sequencing
Box plots of quality scores over positions in sequenced reads.
BF nd (Next) Generation Sequencing
Transcript length distribution resulting from different assemblies of the embryo samples across the three technologies (HiSeq, MiSeq, and PacBio). Transcript.
BF528 - Sequence Analysis Fundamentals
IWGS workflow. iWGS workflow. A typical iWGS analysis consists of four steps: (1) data simulation (optional); (2) preprocessing (optional); (3) de novo.
Linux + Genome Assembly Tutorial
Linear schematic of the basic quality control procedure for marker gene (microbiome) data. Linear schematic of the basic quality control procedure for.
Presentation transcript:

Sequencing Data Quality Saulo Aflitos

Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion Assembly - Concepts

Scaffold (≈ 2Mbp) Paired-End Mate-Pair LowComplexityRegion Pseudo Molecule (Super Scaffold) Scaffolding

Assembly

Repeats?! Scaffolding

Goldberg SMD et al x 3x2x 3x 1x Consensus Reads Contig Depth of Coverage Reality

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA NAAACGTACGTAAAANAAACGTACGTAAAA A/C A C 95% ±550% ±10 Heterozygozity

Raw Filtered Consequences of Data Cleaning

Sequencing Shotgun RNAseq

Sequencing Paired End Mate Pair

Shred Size Selection Adapter Sequencing Genome Ultrasound Physical RE Gel Beads ID Binding to Surface Circularization Illumina 454 PacBio Sample Preparation

Shredding

Size Selection

100bp Insert Size 150bp-2Kbp Illumina PE Read Length Sequencing

Insert Size 2K-20Kbp Read Length 500bp 454 MP 150bp Sequencing

Data

Machine Name Read ID (unique) Encoded Quality 0-40 Chance of being wrong FastQ

FastQ Format

% FastQ Statistics

Cleaning

Sequence duplication Per base N-content Per base GC content Per base sequence quality Per sequence quality Sequence length distribution Per base sequence content Contamination screen fastq screen Per sequence GC content FastQC Quality Checking Tool

SolexaQA Cleaning Tool

Exercise Create “cleaning” folder – mkdir cleaning; cd cleaning Inside it, run: wget -O saulo.bash Run it with: bash saulo.bash This will download FastQC and SolexaQA – FASTQC HELP : – FASTQC TUTORIAL: – FASTQC MANUAL : – SolexaQA Help : Run FastQC:./FastQC/fastqc & File > open [Files of Type = FastQ files]

Exercise Verify the two.fq files (you can use less ): – bad_MiSeq_dataset.fq – good_MiSeq_dataset.fq Clean the bad dataset with SolexaQA’s DynamicTrim.pl script: – perl SolexaQA_v.2.1/DynamicTrim.pl ► bad_MiSeq_dataset.fq -h 25 Verify the improvement (or not) by opening – bad_MiSeq_dataset.fq.trimmed

?