Download presentation
Presentation is loading. Please wait.
Published byMaude Fleming Modified over 9 years ago
1
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment of RNA-Seq Data
2
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org What do the data look like? @SRR638895.6046 6046 length=76 GTGAAAGACTCTCGTAGCAAACGAAACGTCAAGTCGGTGAGGCCAACTCTTGTCGTAGCCGCGTCCATT GCGCCCT +SRR638895.6046 6046 length=76 GDGACGFFGF7EDDAECBEDFEGFGECGEDGFGE:=BDD@FD59B67>:=9>:8>>;;<;=CD@9+=???###### Fastq is a common format for storing Next Gen Sequencing data. Text based Stores both the sequence and quality information Originally developed at Wellcome Trust Sanger Insitute and later adopted by Solexa (Bennett, 2004) Information for each read comprises of 4 lines Bennett, S. (2004). Solexa Ltd. Pharmacogenomics, 5(4), 433-438. doi: 10.1517/14622416.5.4.433
3
@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCA A + @@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI Sequence Identifier Begins with a @ symbol Comprises of Instrument Name Flowcell Lane Tile X and Y coordinates of the Cluster on the Tile Member of a Pair (1 or 2) Index FASTQ Format
4
@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCA A + @@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI Read Sequence (A, G, T, C, N)
5
FASTQ Format @CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA + @@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI ‘+’ character Can be followed by the same Sequence Identifier (from Line1)
6
@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA + @@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI Base Quality Scores (Phred33) for the sequence in Line2 Must contain the same number of characters as those in the sequence FASTQ Format
7
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Sequencers can assign a “confidence” value per call based on how ambiguous the base call is Quality Scores Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186– 194. doi:10.1101/gr.8.3.186. PMID 9521922. The sequencer will estimate the probability that a given base call is NOT correct (Erwing 1998)
8
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Quality Scores Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186– 194. doi:10.1101/gr.8.3.186. PMID 9521922. P-10*log10(p) Est. Accuracy = 1-P 0.1100.9 0.01200.99 0.001300.999 0.0001400.9999 PHRED Score is defined as q = -10 x log 10 (p) (Erwing 1998) P = probability call is not correct
9
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Why not just have numbers? Quality Score Encodings @CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGT… + 3131303537373739…
10
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Why not just have numbers? Quality Score Encodings @CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGT… + 3131303537373739… Quality symbols to the rescue
11
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Letters are represented deep down in the computer as numbers The quality score + a constant number (33 or 64, usually) is the number, which is converted to the quality symbol using ASCII Quality Score Encodings
12
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org ASCII Table
13
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org FastQC is an excellent program for visualizing the overall quality of all reads in a fastq file Quality Scores FastQC is developed by the Babraham Bioinformatics Group: http://www.bioinformatics.babraham.ac.uk
14
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Tactics for increasing overall quality We want to cut away the low quality bases! Trimming Based on Quality ✔
15
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Wholesale cutting by base position Trimming Based on Quality
16
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Start from ends of read and cut away until quality is above a specified threshold (usually 20) Trimming Based on Quality ✔ 3622
17
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Start from one end and keep bases until they fall below a specified threshold Trimming Based on Quality 362
18
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Sliding windows and minimum vs. average quality scores Trimming Based on Quality 25 36 2 Average: Min: Max: 25 Target: Average below 20
19
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Sliding windows and minimum vs. average quality scores Trimming Based on Quality 25 36 2 Average: Min: Max: 34.2 25 36 Target: Average below 20 Step Size = 5 Window Size = 6
20
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Sliding windows and minimum vs. average quality scores Trimming Based on Quality 25 36 2 Average: Min: Max: 13.3 2 36 Target: Average below 20 Step Size = 5 Window Size = 6
21
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Sliding windows and minimum vs. average quality scores Trimming Based on Quality 25 36 2 Average: Min: Max: Target: Average below 20 Step Size = 5 Window Size = 6
22
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Mate pairs, orphans and minimum sequence length Trimming Based on Quality Right read too short to keep Left read survives trimming
23
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Trimmomatic Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170. Trim Galore! developed by the Babraham Bioinformatics Group: http://www.bioinformatics.babraham.ac.uk FASTX Toolkit http://hannonlab.cshl.edu/fastx_toolkit Galaxy Trimming tools Trimming Software Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25;11(8):R86. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.1-21. Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research. 2005 Oct; 15(10):1451-5.
24
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org What’s a Kmer? For a given sequence and a number, K, how many sub- sequences of length K are there? Kmers
25
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Why? Kmers K = 5
26
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org When fragments are shorter than total length of the read, adapters will be sequenced on both mates of a paired-end read. For example, if we use technology that can sequence up to 100 bp: Primers and Adapters
27
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org When to suspect this: Patterns toward ends of reads Primers and Adapters
28
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Software for removing adapters Primers and Adapters Cutadapt Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. 2011, 17(1). doi: 10.14806/ej.17.1.200 pp. 10-12 FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit Scythe https://github.com/ucdavis-bioinformatics/scythe
29
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Library Prep – retained and sequenced poly-As/poly-Ts When to suspect this: Poly-A Tails and Other Artifacts
30
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org PRINSEQ (Schmieder 2011) for trimming poly- Ts – takes a % of the read that contains T’s and sorts them out Conservatively, 60% of a read is T? Kick it out. Filter on % base, sequence complexity, duplicates Poly-A Tails and Other Artifacts Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27:863-864. [PMID: 21278185]21278185
31
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org How much sequence one can afford to cut out depends on the following: Coverage: If your sequence was run with very low coverage, you may not want to cut aggressively Sequence length: You can afford to cut 20bp out of a 150bp read, but not 30bp read Goals: Depending on your end goal, cut more or less aggressively Conservative QC vs Aggressive QC - factors
32
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org References Bennett, S. (2004). Solexa Ltd. Pharmacogenomics, 5(4), 433-438. doi: 10.1517/14622416.5.4.433 Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.1-21. Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170. Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186–194. doi:10.1101/gr.8.3.186. PMID 9521922. Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research. 2005 Oct; 15(10):1451-5. Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25;11(8):R86. Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. 2011, 17(1). doi: 10.14806/ej.17.1.200 pp. 10-12 Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27:863-864. [PMID: 21278185]21278185
33
National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Fin Thanks for watching! Questions and comments: Email help@ncgas.org
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.