Transcriptomics II De novo assembly

Slides:



Advertisements
Similar presentations
Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone.
Advertisements

Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
RNAseq.
Peter Tsai Bioinformatics Institute, University of Auckland
Transcriptomics Jim Noonan GENE 760.
Bioinformatics for the Canadian Potato Genome Project David De Koeyer, Martin Lagüe and Rebecca Griffiths Wageningen September 18, 2004.
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
Genomic assessment of mass-reared vs wild Hawaiian Mediterranean fruit flies Bernarda Calla, Brian Hall, Shaobin Hu, and Scott Geib Tropical Crop and Commodity.
RNAseq analyses -- methods
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Lecture 11. Microarray and RNA-seq II
Transcriptome Analysis
The iPlant Collaborative
RNA Sequencing I: De novo RNAseq
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
Quality Control Hubert DENISE
Introduction to RNAseq
SMARTAR: small RNA transcriptome analyzer Geuvadis RNA analysis meeting April 16 th 2012 Esther Lizano and Marc Friedländer Xavier Estivill lab Programme.
The iPlant Collaborative
The iPlant Collaborative
Comparative transcriptomics of fungi Group Nicotiana Daan van Vliet, Dou Hu, Joost de Jong, Krista Kokki.
De novo assembly of RNA Steve Kelly
BLAST Sequences queried against the nr or grass databases. GO ANALYSIS Contigs classified based on homology to known plant or fungal genes Next.
….. The cloud The cluster…... What is “the cloud”? 1.Many computers “in the sky” 2.A service “in the sky” 3.Sometimes #1 and #2.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Simon v RNA-Seq Analysis Simon v
Annotating The data.
Computing challenges in working with genomics-scale data
de Novo Transcriptome Assembly
RNA Quantitation from RNAseq Data
An Introduction to RNA-Seq Data and Differential Expression Tools in R
Placental Bioinformatics
Cancer Genomics Core Lab
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Metafast High-throughput tool for metagenome comparison
Denovo genome assembly of Moniliophthora roreri
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Genome Expression Balance in a Triploid Trihybrid Vertebrate
Canadian Bioinformatics Workshops
Kallisto: near-optimal RNA seq quantification tool
Long way to solve short ncRNA data analysis problems – evaluation of small RNA-Seq datasets from non-model organisms in Galaxy Jochen Bick Jochen Bick.
Transcriptome Assembly
The Web frame for NGS output
Identification and Characterization of pre-miRNA Candidates in the C
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.
Maximize read usage through mapping strategies
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Assessing changes in data – Part 2, Differential Expression with DESeq2
Volume 23, Issue 10, Pages (June 2018)
Additional file 2: RNA-Seq data analysis pipeline
Transcript length distribution resulting from different assemblies of the embryo samples across the three technologies (HiSeq, MiSeq, and PacBio). Transcript.
Mapping rates of different transcript sets to the P
Sequence Analysis - RNA-Seq 2
BF528 - Sequence Analysis Fundamentals
Transcriptomics – towards RNASeq – part III
Introduction to RNA-Seq & Transcriptome Analysis
Schematic representation of a transcriptomic evaluation approach.
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Transcriptomics II De novo assembly

Sequencing Read Processing Trimmomatic ILLUMINA CLIP – removes specified adapters SLIDINGWINDOW – removes regions falling below quality threshold LEADING/TRAILING – removes if below quality threshold MINLEN – drops reads below length TOPHRED33/TOPHRED64 – converts quality scores java -jar Trimmomatic-0.32/trimmomatic-0.32.jar SE/PE -threads 16 -phred33 -trimlog SP**trimlog SP**.fq SP**trimmed.fq ILLUMINACLIP:/TruSeq2-SE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:75

De novo assembly De novo assembly using Trinity software Memory and time intensive 1GB RAM per 1M sequences 1 hour per 1M sequences (more processors!) Consider in silico normalization perl trinityrnaseq-2.0.3/Trinity --max_memory 240G --CPU 20 --left All_1_trimmed.fq --right All_2_trimmed.fq --SS_lib_type FR --seqType fq --normalize_reads --min_contig_length 200 --full_cleanup --output Trinity_Pb_Normalized &> Trinity_Pb_Normalized.log

Why Trinity? -Paired reads -Isoform differentiation -Full length transcripts

Assembly Results Total trinity 'genes': 421044 Total trinity transcripts: 537064 Percent GC: 44.20 Contig N50: 2160 Median contig length: 444 Average contig: 1011.94 Too many assembled transcripts Use CD-HIT/RSEM to prune

CD-HIT-EST and RSEM RSEM – RNA-Seq by Expectation Maximization: Estimates gene and transcript level abundance Prune reads with 0.001 FPKM (1 per billion) perl trinityrnaseq_r20140413/util/filter_fasta_by_rsem_values.pl --rsem_output RSEM.isoforms.results --fasta Trinity.fasta --output FPKM_0.001.fasta --fpkm_cutoff 0.001 Total trinity genes = 393972, Total trinity transcripts = 480964 CD-HIT: Combines sequences based upon similarity 100% Identity cd-hit-v4.6.1-2012-08-27/cd-hit-est -i TB_Manuscript2\ FINAL\ DATA/Trinity.fasta -c 1.0 -n 8 -o cd-hit-v4.6.1-2012-08-27/Trinity_CDHIT100 -T 20 -M 100000 Total trinity genes = 246333, Total trinity transcripts = 314638

Annotation BLAST– Sequence homology B2G4PIPE/BLAST2GO – GO term use computing cluster, run array job if parallel implementations not available B2G4PIPE/BLAST2GO – GO term Command line version/graphical version KEGG – Pathway analysis Available as stand alone or within BLAST2GO

Time to relax….not quite.

Map - Bowtie v. Sailfish Bowtie – little upfront investment, but need to map millions/billions of reads Sailfish – large upfront investment in K-mer library, but no need to map billions of reads.

Quantify Run RSEM on each individual sample Use Trinity Pipeline to combine samples into a single expression table Gene level Transcript level Use edgeR (Empirical analysis of digital gene expression data in R) within Trinity Pipeline

Trinity DGE Pipeline/Post Analysis Data Generated A sample to sample DGE comparison Consensus DGE comparison Visualizations Possible Volcano Plots Heatmaps Cluster Dendrograms Utilize Data Use clusters and up- and down- regulated subgroups to identify genes, GO terms and pathways that experience changes in regulation.