Download presentation
Presentation is loading. Please wait.
Published byPercival Wilkerson Modified over 9 years ago
1
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151
2
Sequence Formats All Sequence formats are ASCII text containing sequence ID, Quality Scores, Annotation details, comments, and other descriptions about sequence Formats are designed to hold sequence data and other information about sequence 8/19/20152
3
Why so many formats? 8/19/20153 Supply required information for each step of analysis Efficient Data management- moving data across file system takes time Each Data formats vary in the information they contain Five types of sequence file formats Raw Sequence files Co-ordinate files Parameter files Annotation files Metadata files
4
Sequencers & Sequence Analysis Packages 8/19/20154
5
Read output formats 454 Solexa/Illumina SOLiD 8/19/20155
6
454 output formats.sff.fna.qual 8/19/20156
7
Illumina output formats.seq.txt.prb.txt Illumina FASTQ (ASCII – 64 is Illumina score) Qseq (ASCII – 64 is Phred score) Illumina single line format SCARF 8/19/20157
8
SOLiD output format(s) CSFASTA 8/19/20158
9
If reads should be deposited in a public repository: SRA (Short Read Archive) at NCBI ENA at EMBL-EBI 8/19/20159
10
Common (“standard”) format for read alignments: Alignment/Assembly Format SAM BAM (= binary SAM) 8/19/201510
11
Formats for Genome/Gene annotation BED format (genome-browser tracks) GFF format (gene/genome features) BioXSD (XML) (any annotation; under development) 8/19/201511
12
Deposit genome/metagenome in a public repository: INSDC databases: GenBank, EMBL, DDBJ Deposit genome/metagenome metadata: MIGS/MIMS standard by GSC Genomic Standards Consortium International Nucleotide Sequence Database Collaboration 8/19/201512
13
MIGS: Minimum Information about a Genome Sequence MIMS: Minimum Information about a Metagenome Sequence/Sample 8/19/201513
14
Use raw sequencing data- format when possible For base-call data, use “standard” FASTQ (Sanger, Phred) For read alignments, use SAM/BAM format For annotation results (e.g. GFF or BED format) Points to remember on Data Formats 8/19/201514
15
QC analysis 8/19/201515
16
Need for QC & Preprocessing QC analysis of sequence data is extremely important for meaningful downstream analysis To analyze problems in quality scores/ statistics of sequencing data To check whether further analysis with sequence is possible To remove redundancy (filtering) To remove low quality reads from analysis Highly efficient and fast processing tools are required to handle large volume of datasets 8/19/201516
17
FastQC and FastX Toolkit Use FastQC in preliminary analysis Use FastX-toolkit to optimize different datasets and visualize the results with FastQC 8/19/201517
18
FastQC output Basic statistics Quality- Per base position Per Sequence Quality Distribution Nucleotide content per position Per sequence GC distribution Per base GC distribution Per base N content Length Distribution Overrepresented/ duplicated sequences K-mer content 8/19/201518
19
FastQC (Box-Whisker plot) Y axis- Quality Score X axis- Base position 8/19/201519
20
Basic Statistics Contains information about File_type ASCII encoding quality value Total sequences, filtered sequence Sequence length Percentage GC content 8/19/201520
21
2. Quality- Per base position 8/19/201521
22
2. Quality- Per base position 8/19/201522
23
3.Per Sequence Quality Distribution 8/19/201523
24
3. Per Sequence Quality Distribution 8/19/201524
25
4.Nucleotide content per position 8/19/201525
26
4. Nucleotide content per position 8/19/201526
27
5.Per sequence GC distribution 8/19/201527
28
5.Per sequence GC distribution 8/19/201528
29
6. Per base GC distribution 8/19/201529
30
6. Per base GC distribution 8/19/201530
31
7. Per base N content 8/19/201531
32
7. Length Distribution 8/19/201532
33
8. Kmer content 8/19/201533
34
9. Overrepresented/ duplicate sequences Too many duplicate regions in the sequence will be due to sequencing problems 8/19/201534
35
FASTX Toolkit fastx_quality_stats.txt fastq_quality_boxplot_graph.png fastx_nucleotide_distribution.png QC report.txt 8/19/201535
36
QC Report Sequence Statistics Total No. Of Sequences6970943 Avg. Sequence Length54 Max Sequence Length54 Min Sequence Length54 Total Sequence Length376430922 Total N bases14254521 % N bases3.78676 No of Sequences with Ns278635 % Sequences with Ns3.99709 Quality Statistics Total HQ bases334195496 %HQ bases88.78 Total HQ reads6350256 %HQ reads91.0961 8/19/201536
37
quality_boxplot_graph & nucleotide_distribution 8/19/201537
38
Thank you 8/19/201538
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.