Download presentation
Published byPaulina Griffith Modified over 9 years ago
1
Introduction To Next Generation Sequencing (NGS) Data Analysis
Jenny Wu UCI Genomics High Throughput Facility
2
Outline Goals : Practical guide to NGS data processing
Bioinformatics in NGS data analysis Basics: terminology, data file formats, general workflow Data Analysis Pipeline Sequence QC and preprocessing Obtaining and preparing reference Sequence mapping Downstream analysis workflow and software Example: RNA-Seq analysis with Tuxedo protocol Summary and future plan
3
Why Next Generation Sequencing
One can sequence hundreds of millions of short sequences (35bp-120bp) in a single run in a short period of time with low per base cost. Illumina/Solexa GA II / HiSeq 2000, 2500 Life Technologies/Applied Biosystems SOLiD Roche/454 FLX, Titanium Reviews: Michael Metzker (2010) Nature Reviews Genetics 11:31 Quail et al (2012) BMC Genomics Jul 24;13:341.
4
Why Bioinformatics Informatics (wall.hms.harvard.edu)
5
Bioinformatics Challenges in NGS Data Analysis
VERY large text files (tens of millions of lines long) Can’t do ‘business as usual’ with familiar tools Impossible memory usage and execution time Manage, analyze, store, transfer and archive huge files Need for powerful computers and expertise Informatics groups must manage compute clusters New algorithms and software are required and often time they are open source Unix/Linux based. Collaboration of IT, bioinformaticians and biologists
6
Basic NGS Workflow
7
NGS Data Analysis Overview
Olson et al.
8
Outline Goals Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general workflow Analysis Pipeline Sequence QC and preprocessing Obtaining and preparing reference Sequence mapping Downstream analysis workflow and software RNA-Seq analysis with Tuxedo protocol Summary and future plan
9
Terminology Coverage (depth): The number of nucleotides from reads that are mapped to a given position. Quality Score: Each called base comes with a quality score which measures the probability of base call error. Mapping: Align reads to reference to identify its origin. Assembly: Merging of fragments of DNA in order to reconstruct the original sequence. Duplicate reads: Reads that are identical. Multi-reads: Reads that can be mapped to multiple locations equally well.
10
What does the data look like? Common NGS Data Formats
11
FASTA Format (Reference Seq)
12
FASTQ Format (reads)
13
FASTQ Format (Illumina Example)
@DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT + BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ @DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG @DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC @DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ Read Record Header Read Bases Separator (with optional repeated header) Read Quality Scores Flow Cell ID Lane Tile Tile Coordinates Barcode NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads. (Passarelli, 2012)
14
Outline Goals Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general workflow, Analysis Pipeline Sequence QC and preprocessing Obtaining and preparing reference Sequence mapping Downstream analysis workflow and software RNA-Seq analysis with Tuxedo protocol Summary and future plan
15
Data Analysis Pipeline
Raw reads Read QC and preprocessing Read Mapping Analysis-ready reads FASTQ FASTQC, FASTX-toolkit, PRINSEQ Local realignment, base quality recalibration SAM/BAM Mapped reads Visualization (IGV, USCS GB) Bowtie, BWA, MAQ Collecting reference sequences and annotation FASTA GTF/GFF Data Task File Format Software Whole Genome Sequencing: Variant calling, annotation RNA-Seq: Transcript assembly, quantification ChIP-Seq : Peak Calling Methyl-Seq: Methylation calling ……
16
Why QC? Sequencing runs cost money
Consequences of not assessing the Data Sequencing a poor library on multiple runs – throwing money away! Data analysis costs money and time Cost of analyzing data, CPU time $$ Cost of storing raw sequence data $$$ Hours of analysis could be wasted $$$$ Downstream analysis can be incorrect.
17
How to QC? $: fastqc s_1_1.fastq;
available on HPC Tutorial :
18
Outline Goals Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general workflow, Analysis Pipeline Sequence QC and preprocessing Obtaining and preparing reference Sequence mapping Downstream analysis workflow and software RNA-Seq analysis with Tuxedo protocol Summary and future plan
19
The UCSC Genome Browser Homepage
General information Get genome annotation here! Get reference sequences here! Specific information— new features, current status, etc.
20
Getting reference sequences
21
Getting Reference Annotation
22
Outline Goals Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general workflow, Analysis Pipeline Sequence QC and preprocessing Obtaining and preparing reference Sequence mapping Downstream analysis workflow and software RNA-Seq analysis with Tuxedo protocol Summary and future plan
23
Sequence Mapping Challenges
Alignment (Mapping) is the first steps once read sequences are obtained. The task: to align sequencing reads against a known reference Difficulties: high volume of data, size of reference genome, computation time, read length constraints, ambiguity caused by repeats and sequencing errors.
24
Short Read Alignment Olson et al.
25
Short Read Alignment Software
26
Short Reads Mapping Software
27
How to choose an aligner?
There are many aligners and they vary a lot in performance (accuracy, memory usage, speed, etc). Factors to consider : application, platform, read length, downstream analysis, etc. Constant trade off between speed and sensitivity (e.g. MAQ vs. Bowtie) Guaranteed high accuracy will take longer.
28
Outline Goals Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general workflow, Analysis Pipeline Sequence QC and preprocessing Obtaining and preparing reference Sequence mapping Downstream analysis workflow and software RNA-Seq analysis with Tuxedo protocol Summary and future plan
29
NGS Applications and Analysis Strategy
Name Nucleic acid population Brief analysis strategy RNA-Seq RNA (may be poly-A mRNA or total RNA) Alignment of reads to “genes”; variations for detecting splice junctions and quantifying abundance Small RNA sequencing Small RNA (often miRNA) Alignment of reads to small RNA references (e.g. miRbase), then to the genome; quantify abundance ChIP-Seq DNA bound to protein, captured via antibody (ChIP = Chromatin ImmunoPrecipitation) Align reads to reference genome, identify peaks & motifs RIP-Seq RNA bound to protein, captured via antibody (RIP = RNA ImmunoPrecipitation) Align reads to reference genome and/or “genes”, identify peaks and motifs Methylation Analysis Select methylated genomic DNA regions, or convert methylated nucleotides to alternate forms Align reads to reference and either identify peaks or regions of methylation SNP calling/ discovery All or some genomic DNA or RNA Either align reads to reference and identify statistically significant SNPs, or compare multiple samples to each other to identify SNPs Structural Variation Analysis Genomic DNA, with two reads (mate-pair reads) per DNA template Align mate-pairs to reference sequence and interpret structural variants de novo Sequencing Genomic DNA (possibly with external data e.g. cDNA, genomes of closely related species, etc.) Piece-together reads to assemble contigs, scaffolds, and (ideally) whole-genome sequence Metagenomics Entire RNA or DNA from a (usually microbial) community Phylogenetic analysis of sequences (Hunicke-Smith et al, 2010)
30
Application Specific Software
Mapped reads Whole Genome Sequencing, Exome Sequencing RNA-Seq: Transcriptome analysis ChIP-Seq : Protein DNA binding site, Methyl-Seq: Methylation pattern analysis …… Peak Identification Variant Calling: SNPs, InDels 1: Transcriptome assembly 2. Abundance quantification 3. Differential expression and regulation Methylation calling ssahaSNP, Samtools, PyroBayes Tophat, STAR, Cufflinks, edgeR, MACS, AREM, PeakSeq Bismark, BS Seeker
31
Outline Goals Bioinformatics Challenges in NGS data analysis
Basics: terminology, data file formats, general workflow, Analysis Pipeline Sequence QC and preprocessing Obtaining and preparing reference Sequence mapping Downstream analysis workflow and software RNA-Seq analysis with Tuxedo protocol Summary and future plan
32
RNA-seq (Tuxedo Protocol)
RNA-seq (Tuxedo Protocol) Read mapping SAM/BAM 2. Transcript assembly and quantification GTF/GFF 3. Merge assembled transcripts from multiple samples 4. Differential Expression analysis
33
1. Spliced Alignment: Tophat
Tophat : a spliced short read aligner for RNA-seq. $ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C2_R1_2.fq $ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C2_R2_2.fq
34
2.Transcript assembly and abundance quantification: Cufflinks
CuffLinks: a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide. $ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/ accepted_hits.bam $ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/ accepted_hits.bam $ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/ accepted_hits.bam $ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/ accepted_hits.bam
35
3. Final Transcriptome assembly: Cuffmerge
$ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt $ more assembies.txt ./C1_R1_clout/transcripts.gtf ./C1_R2_clout/transcripts.gtf ./C2_R1_clout/transcripts.gtf ./C2_R2_clout/transcripts.gtf
36
4.Differential Expression: Cuffdiff
CuffDiff: a program that compares transcript abundance between samples. $ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf ./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam, ./C2_R1_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam
37
Integrative Genomics Viewer (IGV)
38
Visualizing RNA-seq mapping with IGV
Specify range or tem in search box Click on ruler Click and drag Use scroll bar Use keyboard: Arrow keys, Page up Page down, Home, End Neilsen, C.B., et al. Visualizing Genomes: techniques and challenges Nature Methods 7:S5‐S15 (2010)
39
Summary NGS technologies are transforming molecular biology.
Bioinformatics analysis is a crucial part in NGS applications Data formats, terminology, general workflow Analysis pipeline Software for various NGS applications RNA-seq with Tuxedo suite Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.