RNA-Seq Green Line Overview

RNA-Seq Green Line Overview

RNA-Seq data analysis via the Green Line of DNA subway
Data analysis goals Use a comprehensive RNA-seq data analysis pipeline Runs in a‘reasonable’amount of time with modest computer resources Self contained, self explanatory, adaptable

Running the Green Line: Raw RNA-seq data to differential gene expression
FastQC: Quality check raw reads FastX toolkit: Trim/Filter TopHat: Splice align to reference genome cummeRbund: Visualize results CuffDiff: Transcript differential abundance CuffLinks: Transcript assembly and quantification

RNA-Seq data analysis Part 1
Overview of the RNA-Seq data analysis pipeline Data quality control - QC of raw sequencing data run with fastQC Raw sequencing quality metrics fastA and fastQ file formats fastQC metrics and output files Basic statistics Per base sequence quality Per sequence quality scores FastX toolkit – Importance of running FastX tools on raw data before aligning Quality Trimmer – basic and advanced parameters Quality Filter - basic and advanced parameters Output files: compare to those run on the raw data File formats used by TopHat Reference genome file (FastA) Gene annotation file (GTF/GFF)

RNA-Seq data analysis Molly Hammell - Recorded video Q&A June 10 2p-3p
Part 2 RNA-Seq data analysis using the Tuxedo protocol a) TopHat b) CuffLinks c) Cuffmerge d) CuffDiff Determining differential gene expression

Upload data from Data Store Examine quality of raw seqeuncing reads
Stop 1: Manage data: Create a Project Upload data from Data Store Examine quality of raw seqeuncing reads Illumina sequencer Your Data iPlant Data Store .fastQ

Why look at the quality of raw sequencing reads?
Bad quality data could result in erroneous data analyses The first step of analyzing data is aligning to a reference genome If the data are of poor quality, it will be difficult to align and/or can align improperly How can bad quality data be generated? Sequencing library prep and technology – both can lead to issues during data acquisition (to be covered in detail by Illumina rep) PCR amplification of library – some sequences “over-amplified” Sequencing adapters – read-through Density of the cluster formation on the flow cell – too dense; signals overlap Incorporation of the fluorescently labeled nucleotides becomes asynchronous Imaging of the clusters via microscope – too close to an edge To gauge the quality, each cycle of dNTP incorporation is given a quality score = Phred quality score Duplicates: same exact start and end position. Duplicates may correspond to biased PCR amplification of particular fragments. But for highly expressed, short genes, duplicates are expected even if there is no amplification bias; Removing them may reduce the dynamic range of expression estimates Assess library complexity and decide…If you do remove them, assess duplicates at the level of paired-end reads (fragments) not single end reads

Phred quality scores Phred is a base-calling program for automated sequencer traces Assigns a base together with the probability of that base call being correct Probability of Quality score incorrect base call Base call accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% 40 1 in 10,000 99.99% Green and Ewing: Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res. 1998; 8:186–194.

NGS File formats FastA – text based format for any alphabet (nucleotides or amino acids) Top line (header) is always: >Sequence_ID (space) other info………(hard line break) Followed by: Sequence >Seq1D [organism=Carpodacus mexicanus] C.mexicanus clone 6b actin (act) mRNA, partial cds CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCCCCTCAGCCTCCTCATCCGTGCAGAACTTGGACAACCTGGAACTCTTCTAGGAGACGACCAAATTTACAATGTAATCGTCACTGCCCACGCCTTCGTAATAATTTTCTTTATAGTAATACCAATCATGATCGGTGGTTTCGGAAACTGACTAGTCCCACTCATAAT FastQ = FastA with Phred Quality scores FastQ file is generated by the Illumina machine; software versions for this quality basecalling vary with the type of machine and new software updates Each file has 4 lines: Note: Encoding option when running FastQC Header: Read ID @SEQUENCING_MACHINE_ID#######_ReadPair#_Barcode Sequence of the read: A T G C Quality score identifier: + Quality score of each NT: ! “ * ( ) % (ASCII score for each base-ranges from 0-40)

In every sequencing run, millions of sequence reads are produced Each read is one fastq file
For single-end: one fastq file For paired-end: two fastq files: Designated as “_R1.fastq” and “_R2.fastq”

Assess the data quality and clean it up
Evaluate the quality of the raw sequencing data with FastQC Quality trim and quality filter with FastX toolkit (Stop 2 of Green line)

Evaluate the quality of the raw sequencing data with FastQC
View the QC reports

FastQC reports on raw sequencing data Summary
10 parameters are checked Green Check means they passed Red X means failed Yellow exclamation means warning Not all are important; must evaluate relative to the context of what you expect from your library “Normal” is random and diverse Most important: Basic stats Per base sequence quality Per sequence quality scores Per base N content Sequence Length Distribution Overrepresented sequences The names of the modules are preceded by an icon that reflects the quality of the data. The icon indicates whether the results of the module seem entirely normal (green tick), slightly abnormal (orange triangle) or very unusual (red cross). However these evaluations must be taken in the context of what you expect from your library. A 'normal' sample as far as FastQC is concerned is random and diverse. Some experiments may be expected to produce libraries which are biased in particular ways. You should treat the icons as pointers to where you should concentrate your attention on and understand why your library may not look normal.

FastQC reports on raw sequencing data: Basic statistics
31 million reads in this one fastq file Filename: The original filename of the fastq file analyzed. File type: Conventional base calls (Illumina) vs. color space data (ABI’s SOLiD) Encoding: metric used to call quality scores that is based on the specific sequencing machine (ASCII encoding of quality values found in the file). Total Sequences: total number of sequences in the .fastq file Sequence Length: Reports the length of the shortest and longest sequence in the set. For Illumina, before any trimming/filtering, always one length depending on the # of cycles run. %GC: The overall %GC of all bases in all sequences; will vary with organism Encoding: Iillumina HiSeq and NextSeq is 1.9 (same as Sanger). Basic Statistics never raises a warning.

FastQC reports on raw sequencing data: Per base sequence quality
Range of quality scores of each nucleotide, at each nucleotide position Green: very good quality >Q28 Orange: reasonable quality Q20-Q28 Red: poor quality <Q20; red quality score position in read (very) Good data! For each nucleotide position, a BoxWhisker type plot is drawn The central red line is the median value The yellow box represents the inter-quartile range (25-75%) The upper and lower whiskers represent the 10% and 90% points The blue line represents the mean quality For base # 1, for all the clusters/reads combined, the phreds were between 30 and 34. After base #3, the quality dropped off some. Yellow box appears if there is a wide range of quality scores at a specific nucleotide position and shows the range from 25% to 75%. If this box crosses below Q20, the sequence for that nucleotide position is of low quality for too many reads (>25%). However, data with <Q20 may still be useful: Using FastX toolkit, the next stop on the Green line, one can "trim" off any poor quality nucleotides from the 3' end (where the yellow box drops below Q20), and nucleotides subsequent to those. N.B. The quality of base calls often degrades as the run progresses, so it is common to see the quality falling down the end of a read.

FastQC reports on raw sequencing data: Per base sequence quality
Good Consistent quality across the sequence Mean quality above Q30 Not so good Need to trim the bad

FastQC reports on raw sequencing data: Per sequence quality scores
Overall quality of each of the sequencing reads Asking: Is there a subset of sequence reads with universally low quality values? mean sequence quality score Good: majority of reads >Q34 # sequencing reads 10 x 10^6 Q35; 2x106 Q34 10,000 out of 60,000 or 16.6% are below Q20 Not so good Here, >10,000,000 sequences (reads) had an average quality per read of 35 and 2,000,000 had average quality of Q32 This scores as failure if the most frequently observed mean quality is below Q20 - this equates to a 1% error rate.

FastQC reports on raw sequencing data: Per base N content
Percentage of base calls at each position for which an N was called Ideal: want every base to be “called” unambiguously A, G, C, or T Reality: some bases cannot be called and are assigned “N” position in read (bp) Great Fail: if any position shows an N content of >20%. Not so great % base call

FastQC reports on raw sequencing data: Sequence length distribution
Will fail if any of the sequences have zero length Good (expected) Illumina: a fixed number of cycles (nucleotide additions), expect this to be only one length

FastQC reports on raw sequencing data: Overrepresented sequences
Due to random fragmentation, and poly A+ selection, your sequencing libraries should contain a highly diverse set of sequences. (If stranded mRNA) Those sequences that appear >0.1% of the total are called as overrepresented. These could be due to real biological signal and significant, or due to commonly found contaminants, such as rRNA or Illumina primer (adapter) sequences If there are >10.0% Illumina primer (adapter) sequences in the file, it is recommended to try and clip these sequences using the FastX toolkit. If these are not clipped, aligning may be slow and/or be erroneous. “No Hits” are usually Illumina primers or barcodes. Bowtie and BWA do not include identifying rRNAs. If one needs to do this, create a separate filtering sequence of rRNA for your species (you might also include mitochondrial DNA in this filtering data set), then align all reads to this filter set. The important statistic is the percentage of reads in each sample which align to the rRNA filter. The total number of reads in the filtered data set (not the original FASTQ file) should be used for per-sample normalization by RPKM or other methods. Duplication detection requires an exact sequence match over the whole length of the sequence any reads over 75bp in length are truncated to 50bp for the purposes of this analysis, and due to random RNA fragmentation, there should not be two fragments with the exact start and stop ends. Fail: module will issue an error if any sequence is found to represent more than >0.1% of the total (Only displays the first 100,000 to conserve memory)

FastQC: Quality check raw reads FastX toolkit: Trim/Filter TopHat: Splice align to reference genome cummeRbund: Visualize results CuffDiff: Transcript differential abundance CuffLinks: Transcript assembly and quantification

Stop 2: FastX Toolkit

Stop 2: Cleaning up the data with FastX toolkit: Quality trimmer
Scans the sequence from the (right) end for the first nucleotide to possess the specified minimum quality score Green Line Basic default is Q20 Trims that nucleotide and all subsequent ones after this position After trimming, sequences that are shorter than the specified minimum length are discarded Green Line Basic default is 20 bases Input FASTQ file: (only 3 lines shown) @1 TATGGTCATGGCATGTAAAC @2 CAGCGAGGCTTTAATGCCAT @3 CAGCCGAGGCTTTAATCGCG Trimming with a cut-off of quality 20, we get the following FASTQ file: @1 TATGGTCATGGCATGTAAAC @2 CAGCGAGGCTTTAATGCCAT @3 CAGCCGAGGCTTTAATCGCG (9 bases left) THIS EXAMPLE THE CUT OFF IS 12 although GL is 20 Trimming with a cut-off of 20 and a minimum length of 12, we get the following FASTQ file: @1 TATGGTCATGGC (12 bases) @2 CAGCGAGGCTTT (12 bases)

Cleaning up the data with FastX toolkit: Quality filter
Filter based on quality scores 100 percent = all cycles (nucleotide incorporations) of all reads to be at least the user-specific quality cut-off value Green Line Basic default is Q20 50 percent = requires the median quality of the cycles (in each read) to be at least the user-specified quality cut-off value Green Line Basic default is 50% @1 GACAATAAAC (10 bases) If quality filter set at 100% and cut-off of Q20: This read will be discarded; not all the cycles have quality > = 20 Using 50% and cut-off of Q20: This read will not be discarded; the median quality is higher than 20 But since trimming is performed before filtering……this read would have been removed as it does not meet criteria of minimum length of 12

In the Green Line, Paired end reads are “re-synchronized” after trimming/filtering before aligning with TopHat paired end read 1 paired end read 2 1 2 3 4 5 Read #4 bad quality 1 2 3 4 5 #4 Filtered out 1 2 3 4 5 Re-synchronized

Basic default settings in the Green Line
FastX Toolkit Basic default settings in the Green Line Basic default parameters: FastQ Quality Trimmer Quality Threshold: 20 Minimum Length: 20 FastQ Quality Filter Minimum Quality: 20 Minimum Percent: 50 If these settings are too stringent (based on looking at the FastQC report) and want to be able to analyze more data, even if of lesser quality, reduce the stringency. If you don’t get many mapped reads, you may want to reduce the stringency OR if the quality is really bad towards the end of the reads, may want to trim off more bases and drop the min trimmed length down to 19 Run in Advanced mode to change these parameters Why/What would you change?

FastQC reports after FastX toolkit Basic Statistics
Trimmed some reads down to minimum length of 20 Filtered out ~25,000 sequences ~0.8%

GTF/GFF file formats GFF General Feature Format, Gene-Finding Format, Generic Feature Format Describes features of DNA, RNA or protein sequences seqname - name of the chromosome or scaffold source - name of the program that generated this feature feature - feature type name, e.g. Gene, Variation, Similarity start - Start position of the feature, with sequence numbering starting at 1. end - End position of the feature, with sequence numbering starting at 1. score - A floating point value. strand - defined as + (forward) or - (reverse). frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, etc., attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature. GTF Gene Transfer Format is a refinement to GFF; adds transcript information gene_id value - A globally unique identifier for the genomic source of the transcript transcript_id value - A globally unique identifier for the predicted transcript Transcript features such as chromosome positions for exons, CDS, start and stop codons What is twin scan? A gene-structure prediction program. The GTF file links the Gene with all the transcripts for that gene. AB Twinscan CDS 700 707 . + 2 gene_id "AB "; transcript_id "AB "; AB Twinscan exon 900 1000 . + . gene_id "AB "; transcript_id "AB "; AB Twinscan start_codon 380 382 . + 0 gene_id "AB "; transcript_id "AB "; AB Twinscan stop_codon 708 710 . + 0 gene_id "AB "; transcript_id "AB ";

Extra information

Fastq format Lines 2 and 4 MUST have the same number of characters
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Lines 2 and 4 MUST have the same number of characters The first line contains Illumina instrument information @EAS139:136:FC706VJ:2:2104:15343:197393 X and Y coordinates on the flowcell Run ID Flowcell ID Flowcell lane Instrument ID Number within that flowcell

Duplication level (slide 1 of 2)
This module will issue a warning if non-unique sequences make up more than 20% of the total. During sequencing library preparation, there is a PCR step and there could be some PCR bias resulting in duplication of the same PCR fragment. Especially for differential gene expression, since the mapped reads are “counted”: Don’t want to “over count” those reads that are PCR artefacts. However, for some very highly expressed genes, there could be identical reads that are real. See next slide for explanation Adapted from :

Duplication level (slide 2 of 2)
From Common reasons for warnings The underlying assumption of this module is of a diverse unenriched library. Any deviation from this assumption will naturally generate duplicates and can lead to warnings or errors from this module. In general there are two potential types of duplicate in a library, technical duplicates arising from PCR artefacts, or biological duplicates which are natural collisions where different copies of exactly the same sequence are randomly selected. From a sequence level there is no way to distinguish between these two types and both will be reported as duplicates here. A warning or error in this module is simply a statement that you have exhausted the diversity in at least part of your library and are re-sequencing the same sequences. In a supposedly diverse library this would suggest that the diversity has been partially or completely exhausted and that you are therefore wasting sequencing capacity. However in some library types you will naturally tend to over-sequence parts of the library and therefore generate duplication and will therefore expect to see warnings or error from this module. In RNA-Seq libraries sequences from different transcripts will be present at wildly different levels in the starting population. In order to be able to observe lowly expressed transcripts it is therefore common to greatly over-sequence high expressed transcripts, and this will potentially create large set of duplicates. This will result in high overall duplication in this test, and will often produce peaks in the higher duplication bins. This duplication will come from physically connected regions, and an examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication, but these distinctions are not possible from raw fastq files. A similar situation can arise in highly enriched ChIP-Seq libraries although the duplication there is less pronounced. Finally, if you have a library where the sequence start points are constrained (a library constructed around restriction sites for example, or an unfragmented small RNA library) then the constrained start sites will generate huge dupliction levels which should not be treated as a problem, nor removed by deduplication. In these types of library you should consider using a system such as random barcoding to allow the distinction of technical and biological duplicates.

RNA-Seq Green Line Overview

Similar presentations

Presentation on theme: "RNA-Seq Green Line Overview"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RNA-Seq Green Line Overview

Similar presentations

Presentation on theme: "RNA-Seq Green Line Overview"— Presentation transcript:

Similar presentations

About project

Feedback