Presentation is loading. Please wait.

Presentation is loading. Please wait.

National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment.

Similar presentations


Presentation on theme: "National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment."— Presentation transcript:

1 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment of RNA-Seq Data

2 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org What do the data look like? @SRR638895.6046 6046 length=76 GTGAAAGACTCTCGTAGCAAACGAAACGTCAAGTCGGTGAGGCCAACTCTTGTCGTAGCCGCGTCCATT GCGCCCT +SRR638895.6046 6046 length=76 GDGACGFFGF7EDDAECBEDFEGFGECGEDGFGE:=BDD@FD59B67>:=9>:8>>;;<;=CD@9+=???###### Fastq is a common format for storing Next Gen Sequencing data. Text based Stores both the sequence and quality information Originally developed at Wellcome Trust Sanger Insitute and later adopted by Solexa (Bennett, 2004) Information for each read comprises of 4 lines Bennett, S. (2004). Solexa Ltd. Pharmacogenomics, 5(4), 433-438. doi: 10.1517/14622416.5.4.433

3 @CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCA A + @@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI Sequence Identifier Begins with a @ symbol Comprises of Instrument Name Flowcell Lane Tile X and Y coordinates of the Cluster on the Tile Member of a Pair (1 or 2) Index FASTQ Format

4 @CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCA A + @@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI Read Sequence (A, G, T, C, N)

5 FASTQ Format @CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA + @@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI ‘+’ character Can be followed by the same Sequence Identifier (from Line1)

6 @CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA + @@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI Base Quality Scores (Phred33) for the sequence in Line2 Must contain the same number of characters as those in the sequence FASTQ Format

7 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Sequencers can assign a “confidence” value per call based on how ambiguous the base call is Quality Scores Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186– 194. doi:10.1101/gr.8.3.186. PMID 9521922. The sequencer will estimate the probability that a given base call is NOT correct (Erwing 1998)

8 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Quality Scores Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186– 194. doi:10.1101/gr.8.3.186. PMID 9521922. P-10*log10(p) Est. Accuracy = 1-P 0.1100.9 0.01200.99 0.001300.999 0.0001400.9999 PHRED Score is defined as q = -10 x log 10 (p) (Erwing 1998) P = probability call is not correct

9 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Why not just have numbers? Quality Score Encodings @CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGT… + 3131303537373739…

10 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Why not just have numbers? Quality Score Encodings @CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTG CGTTCAGT… + 3131303537373739… Quality symbols to the rescue

11 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Letters are represented deep down in the computer as numbers The quality score + a constant number (33 or 64, usually) is the number, which is converted to the quality symbol using ASCII Quality Score Encodings

12 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org ASCII Table

13 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org FastQC is an excellent program for visualizing the overall quality of all reads in a fastq file Quality Scores FastQC is developed by the Babraham Bioinformatics Group: http://www.bioinformatics.babraham.ac.uk

14 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Tactics for increasing overall quality We want to cut away the low quality bases! Trimming Based on Quality ✔

15 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Wholesale cutting by base position Trimming Based on Quality

16 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Start from ends of read and cut away until quality is above a specified threshold (usually 20) Trimming Based on Quality ✔ 3622

17 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Start from one end and keep bases until they fall below a specified threshold Trimming Based on Quality 362

18 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Sliding windows and minimum vs. average quality scores Trimming Based on Quality 25 36 2 Average: Min: Max: 25 Target: Average below 20

19 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Sliding windows and minimum vs. average quality scores Trimming Based on Quality 25 36 2 Average: Min: Max: 34.2 25 36 Target: Average below 20 Step Size = 5 Window Size = 6

20 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Sliding windows and minimum vs. average quality scores Trimming Based on Quality 25 36 2 Average: Min: Max: 13.3 2 36 Target: Average below 20 Step Size = 5 Window Size = 6

21 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Sliding windows and minimum vs. average quality scores Trimming Based on Quality 25 36 2 Average: Min: Max: Target: Average below 20 Step Size = 5 Window Size = 6

22 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Mate pairs, orphans and minimum sequence length Trimming Based on Quality Right read too short to keep Left read survives trimming

23 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Trimmomatic Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170. Trim Galore! developed by the Babraham Bioinformatics Group: http://www.bioinformatics.babraham.ac.uk FASTX Toolkit http://hannonlab.cshl.edu/fastx_toolkit Galaxy Trimming tools Trimming Software Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25;11(8):R86. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.1-21. Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research. 2005 Oct; 15(10):1451-5.

24 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org What’s a Kmer? For a given sequence and a number, K, how many sub- sequences of length K are there? Kmers

25 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Why? Kmers K = 5

26 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org When fragments are shorter than total length of the read, adapters will be sequenced on both mates of a paired-end read. For example, if we use technology that can sequence up to 100 bp: Primers and Adapters

27 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org When to suspect this: Patterns toward ends of reads Primers and Adapters

28 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Software for removing adapters Primers and Adapters Cutadapt Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. 2011, 17(1). doi: 10.14806/ej.17.1.200 pp. 10-12 FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit Scythe https://github.com/ucdavis-bioinformatics/scythe

29 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Library Prep – retained and sequenced poly-As/poly-Ts When to suspect this: Poly-A Tails and Other Artifacts

30 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org PRINSEQ (Schmieder 2011) for trimming poly- Ts – takes a % of the read that contains T’s and sorts them out Conservatively, 60% of a read is T? Kick it out. Filter on % base, sequence complexity, duplicates Poly-A Tails and Other Artifacts Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27:863-864. [PMID: 21278185]21278185

31 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org How much sequence one can afford to cut out depends on the following: Coverage: If your sequence was run with very low coverage, you may not want to cut aggressively Sequence length: You can afford to cut 20bp out of a 150bp read, but not 30bp read Goals: Depending on your end goal, cut more or less aggressively Conservative QC vs Aggressive QC - factors

32 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org References Bennett, S. (2004). Solexa Ltd. Pharmacogenomics, 5(4), 433-438. doi: 10.1517/14622416.5.4.433 Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.1-21. Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170. Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186–194. doi:10.1101/gr.8.3.186. PMID 9521922. Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research. 2005 Oct; 15(10):1451-5. Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25;11(8):R86. Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. 2011, 17(1). doi: 10.14806/ej.17.1.200 pp. 10-12 Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27:863-864. [PMID: 21278185]21278185

33 National Center for Genome Analysis Support: http://ncgas.orghttp://ncgas.org Fin Thanks for watching! Questions and comments: Email help@ncgas.org


Download ppt "National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment."

Similar presentations


Ads by Google