Next-generation sequencing data analysis using open source software

Slides:



Advertisements
Similar presentations
DNAseq analysis Bioinformatics Analysis Team
Advertisements

Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.
Assembly Kristoffer H. Ring INF-BIO5121. Task 1.2 – Velvet assembly Was planning to use VelvetOptimiser to determine the optimal kmer size, however, this.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
NGS Analysis Using Galaxy
De-novo Assembly Day 4.
Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.
Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.
Introduction to next generation sequencing Rolf Sommer Kaas.
MES Genome Informatics I - Lecture IV. NGS basics Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University.
MES Genome Informatics I - Lecture V. Short Read Alignment
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Next Generation DNA Sequencing
Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Theobroma cacao Integrated Physical and Genetic Map 2 BAC Libraries 250 Genetic Markers.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of the iPlant Discovery Environment.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID
De Novo Genome Assembly - Introduction
Denovo Sequencing Practical. Overview Very small dataset from Staphylococcus aureus – 4 million x 75 base-pair, paired end reads Cover basic aspects of.
Smooth Sort By: Ahmed Mustafa. Smooth’s Stats Time: Worst Case: Best Case: Average Case: Space: Worst Case: O(n) and O(1) auxiliary BUT HOW?!
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
Short Read Workshop Day 5: Mapping and Visualization
De-novo Bacterial draft genome de-novo asembly, from the sequencing machine (Illumina) to a genome database (NCBI) An example case: Assembly of Stenotrophomonas.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Canadian Bioinformatics Workshops
Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.
From Reads to Results Exome-seq analysis at CCBR
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Are Roche 454 shotgun reads giving a accurate picture of the genome?
Sequencing and Assembly of the WheatD Genome using BAC Pools A Preliminary Study Daniela Puiu Sept 23rd 2013.
Computing challenges in working with genomics-scale data
Using command line tools to process sequencing data
NGS File formats Raw data from various vendors => various formats
Day 5 Mapping and Visualization
Short Read Sequencing Analysis Workshop
Lesson: Sequence processing
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Bacterial Genome Assembly
Variant Calling Workshop
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Assembly.
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
GE3M25: Data Analysis, Class 4
Bacterial Genome Assembly
2nd (Next) Generation Sequencing
MapView: visualization of short reads alignment on a desktop computer
How to Build a Horse: Final Report
AMOS file format (.afg) {LIB iid:453 eid: {DST
Maximize read usage through mapping strategies
Canadian Bioinformatics Workshops
(Top) Construction of synthetic long read clouds with 10× Genomics technology. (Top) Construction of synthetic long read clouds with 10× Genomics technology.
Alignment of Next-Generation Sequencing Data
BF528 - Sequence Analysis Fundamentals
Computational Pipeline Strategies
Linux + Genome Assembly Tutorial
The Variant Call Format
Presentation transcript:

Next-generation sequencing data analysis using open source software Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH ICGEB – Practical Course "Bioinformatics: Computer Methods in Molecular Biology” June 26-30 / 2017

NCBI SRA Portal

NCBI SRA Query

NCBI SRA 454 data

NCBI SRA 454 data

Getting data out of the NCBI SRA

File structure for the Demo user0@head:~$ ls igv marino-data ncbi public_html R workspace user0@head:~$

The FASTQ format http://en.wikipedia.org/wiki/FASTQ_format

Converting SRA to FASTQ user0@head:~$ cd marino-data/sra/ user0@head:~/marino-data/sra$ ls H_pylori_sequence.fasta SRR023794.sra user0@head:~/marino-data/sra$ fastq-dump --split-spot --skip-technical --clip SRR023794.sra Written 231208 spots for SRR023794.sra Written 231208 spots total user0@head:~/marino-data/sra$ head -n 4 SRR023794.fastq @SRR023794.1 FMQS1PV02F25Z5 length=86 GGGTAGGCACAGCGACTGTTCTTATCTTTTTGTGCCTTATATGCATATCCCAGATAGCGTCAATATCCTTAAAGAAGTCGGCACGC +SRR023794.1 FMQS1PV02F25Z5 length=86 7779@EEEFFFFFFFFFFFFFFFFEE55000@8E88::EEFFFFFFEDAAAEEFFFFFFFFFFFFFFF;==;=ADFFFFFFFFFFF

Quality control assessment with FASTQC http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

Running FASTQC user0@head:~/marino-data/sra$ fastqc SRR023794.fastq Started analysis of SRR023794.fastq Approx 5% complete for SRR023794.fastq Approx 10% complete for SRR023794.fastq Approx 15% complete for SRR023794.fastq Approx 20% complete for SRR023794.fastq Approx 25% complete for SRR023794.fastq Approx 30% complete for SRR023794.fastq Approx 35% complete for SRR023794.fastq Approx 40% complete for SRR023794.fastq Approx 45% complete for SRR023794.fastq Approx 50% complete for SRR023794.fastq Approx 55% complete for SRR023794.fastq Approx 60% complete for SRR023794.fastq Approx 65% complete for SRR023794.fastq Approx 70% complete for SRR023794.fastq Approx 75% complete for SRR023794.fastq Approx 80% complete for SRR023794.fastq Approx 85% complete for SRR023794.fastq Approx 90% complete for SRR023794.fastq Approx 95% complete for SRR023794.fastq Analysis complete for SRR023794.fastq user0@head :~/marino-data/sra$

Running FASTQC Open a web browser and point to (Remember to use ***your*** user instead user0): http://23.251.138.125/~user0/

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Break…

Converting SRA to SFF (native format) user0@head:~/marino-data/sra$ sff-dump SRR023794.sra Written 231208 spots for SRR023794.sra Written 231208 spots total user0@head:~/marino-data/sra$

(- # Islands = Ne How many clones should we sequence? )) ( 1 - …according to the work of Lander and Waterman (1988), the number of “islands” or contigs formed from randomly collected sequences depends on: G = Genome Length L = Sequence Read Length N = Number of Sequences Collected T = Number of Basepairs of Overlap Needed (- )) LN G ( 1 - T L # Islands = Ne

5 Mbp Genome, 500 bp reads, 25 bp overlap # reads coverage % sequenced # contigs 2500 0.25 22.12 1971 5000 0.5 39.35 3109 10000 1 63.21 3867 20000 2 86.47 2991 30000 3 95.02 1735 40000 4 98.17 895 50000 5 99.33 433 60000 6 99.75 201 70000 7 99.91 91 80000 8 99.97 40 90000 9 99.99 17 100000 10 100.00 7

De novo sequence assembly Traditional WGS assemblers (Sanger and 454): Amos (UMD) Arachne (Broad) Celera assembler (JCVI/UMD) Newbler (454) Short-read assemblers (Illumina and Solid): SOAPdenovo (BGI) SSAKE/ABySS (GSC) VCAKE (UNC) Velvet (EBI) ALLPATHS (Broad) Euler-SR (UCSD)

De novo assembly with Newbler user0@head:~/marino-data/sra$ runAssembly -o /home/user0/marino-data/H_pylori_De_novo -cpu 8 SRR023794.sff

Comparative assembly with Newbler user0@head:~/marino-data/sra$ runMapping -o /home/user0/marino-data/H_pylori_comparative -cpu 8 H_pylori_sequence.fasta SRR023794.sff

De Novo and Comparative assemblies with Newbler user0@head:~/marino-data$ cat H_pylori_De_novo/454NewblerMetrics.txt largeContigMetrics { numberOfContigs = 30; numberOfBases = 1577109; avgContigSize = 52570; N50ContigSize = 125596; largestContigSize = 327954; Q40PlusBases = 1576070, 99.93%; Q39MinusBases = 1039, 0.07%; user0@head:~/marino-data$ cat H_pylori_comparative/454NewblerMetrics.txt numberOfContigs = 18; numberOfBases = 1560544; avgContigSize = 86696; N50ContigSize = 126162; largestContigSize = 329068; Q40PlusBases = 1558823, 99.89%; Q39MinusBases = 1721, 0.11%;

Celera Assembler Hybrid Assembly exercise user0@head:~/marino-data/sra$ cd ../practice/ user0@head:~/marino-data/practice$ ll total 194988 -rw-r--r-- 1 user0 user0 48489062 Apr 1 18:03 454.frg drwxr-xr-x 3 user0 user0 4096 Apr 1 18:03 ExomeAnalysis drwxr-xr-x 2 user0 user0 4096 Apr 1 18:18 alignment -rw-r--r-- 1 user0 user0 2951079 Apr 1 18:05 genome.fasta drwxr-xr-x 20 user0 user0 4096 Apr 1 18:05 hybrid_CA -rw-r--r-- 1 user0 user0 74104444 Apr 1 18:05 illumina_pe.fq -rw-r--r-- 1 user0 user0 37052222 Apr 1 18:03 read1_illumina.fq -rw-r--r-- 1 user0 user0 37052222 Apr 1 18:06 read2_illumina.fq user0@head:~/marino-data/practice$ user0@head:~/marino-data/practice$ runCA -p ca -d CA 454.frg

Celera Assembler Hybrid Assembly exercise user0@head:~/marino-data/practice$ ll total 195024 -rw-r--r-- 1 user0 user0 48489062 Apr 1 18:03 454.frg drwxr-xr-x 25 user0 user0 4096 Apr 3 17:55 CA drwxr-xr-x 3 user0 user0 4096 Apr 1 18:03 ExomeAnalysis drwxr-xr-x 2 user0 user0 4096 Apr 1 18:18 alignment -rw-r--r-- 1 user0 user0 2951079 Apr 1 18:05 genome.fasta drwxr-xr-x 20 user0 user0 4096 Apr 1 18:05 hybrid_CA -rw-r--r-- 1 user0 user0 74104444 Apr 1 18:05 illumina_pe.fq -rw-r--r-- 1 user0 user0 37052222 Apr 1 18:03 read1_illumina.fq -rw-r--r-- 1 user0 user0 37052222 Apr 1 18:06 read2_illumina.fq -rw------- 1 user0 user0 29194 Apr 3 17:55 runCA_user0.o207 user0@head:~/marino-data/practice$ cd CA/ user0@head:~/marino-data/practice/CA$ ls 0-mercounts 4-unitigger 7-0-CGW 8-consensus ca.ovlStore 0-mertrim 5-consensus 7-1-ECR 9-terminator ca.ovlStore.err 0-overlaptrim 5-consensus-coverage-stat 7-2-CGW ca.asm ca.ovlStore.list 0-overlaptrim-overlap 5-consensus-insert-sizes 7-3-ECR ca.gkpStore ca.qc 1-overlapper 5-consensus-split 7-4-CGW ca.gkpStore.err ca.tigStore 3-overlapcorrection 6-clonesize 7-CGW ca.gkpStore.errorLog runCA-logs user0@head:~/marino-data/practice/CA$ gatekeeper -dumpfrg ca.gkpStore > ca.frg Scanning store to find libraries used. Added 0 reads to maintain mate relationships. Dumping 0 fragments from unknown library (version 1 has these) Dumping 93153 fragments from library IID 1 user0@head:~/marino-data/practice/CA$ toAmos -a ca.asm -f ca.frg -o ca.afg Max ID: 227679 user0@head:~/marino-data/practice/CA$ mkdir ca.bnk

Celera Assembler Hybrid Assembly exercise user0@head:~/marino-data/practice/CA$ bank-transact -b ca.bnk -m ca.afg START DATE: Thu Apr 3 18:05:20 2015 Bank is: ca.bnk 0% 100% AFG .................................................. Messages read: 161335 Objects added: 161335 Objects deleted: 0 Objects replaced: 0 END DATE: Thu Apr 3 18:05:21 2015 user0@head:~/marino-data/practice/CA$ hawkeye ca.bnk & [1] 8469 user0@head:~/marino-data/practice/CA$ Opening ca.bnk... [0.06s] Indexing Contigs .......... [0.05s] 89659 reads in 239 contigs Indexing Scaffolds .......... [0.00s] 20 contigs in 5 scaffolds Indexing Libraries .......... [0.00s] 2 libraries Indexing Mates .......... [0.03s] 51784 mated reads in 67261 fragments Indexing Reads .......... [0.04s] 93154 reads Features not available Initialize Display .Loading AssemblyStats...[0.05s] .Loading Features... [0.00s] .Loading Libraries... [0.00s] .Loading Scaffolds... [0.00s] .Loading Contigs... [0.06s] ....Loading NCharts... [0.00s] . [0.11s] Loading Contig 1... [0.00s] 2 reads Loading reads... [0.00s] Total Load Time: [0.33s]

Celera Assembler Hybrid Assembly exercise

Break…

Velvet Assembly exercise user0@head:~/marino-data/practice/CA$ cd .. user0@head:~/marino-data/practice$ velveth velvet_asm 31 -fastq -shortPaired illumina_pe.fq [0.000000] Reading FastQ file illumina_pe.fq; [2.942968] 304046 sequences found [2.942990] Done [2.943043] Reading read set file velvet_asm/Sequences; [2.985566] 304046 sequences found [3.212025] Done [3.212076] 304046 sequences in total. [3.212138] Writing into roadmap file velvet_asm/Roadmaps... [3.380384] Inputting sequences... [3.380423] Inputting sequence 0 / 304046 [7.236115] === Sequences loaded in 3.855738 s [7.236234] Done inputting sequences [7.236244] Destroying splay table [7.246505] Splay table destroyed user0@head:~/marino-data/practice$ velvetg velvet_asm [0.000001] Reading roadmap file velvet_asm/Roadmaps [0.451835] 304046 roadmaps read [0.452168] Creating insertion markers [0.513922] Ordering insertion markers [0.739075] Counting preNodes [0.783270] 420316 preNodes counted, creating them now [1.586141] Adjusting marker info... [12.627305] Concatenation over! [12.628529] Writing contigs into velvet_asm/contigs.fa... [12.668053] Writing into stats file velvet_asm/stats.txt... [12.669974] Writing into graph file velvet_asm/LastGraph... Final graph has 883 nodes and n50 of 6491, max 21013, total 692956, using 0/304046 reads user0@head:~/marino-data/practice$

Velvet Assembly exercise user0@head:~/marino-data/practice$ cd alignment/ user0@head:~/marino-data/practice/alignment$ cp ../hybrid_CA/9-terminator/hybrid.scf.fasta . user0@head:~/marino-data/practice/alignment$ cp ../velvet_asm/contigs.fa . user0@head:~/marino-data/practice/alignment$ nucmer -p hybrid_vs_illumina hybrid.scf.fasta contigs.fa 1: PREPARING DATA 2,3: RUNNING mummer AND CREATING CLUSTERS # reading input file "hybrid_vs_illumina.ntref" of length 697317 # construct suffix tree for sequence of length 697317 # (maximum reference length is 536870908) # (maximum query length is 4294967295) # process 6973 characters per dot #.................................................................................................... # CONSTRUCTIONTIME /usr/bin/mummer hybrid_vs_illumina.ntref 0.14 # reading input file "/home/user0/marino-data/practice/alignment/contigs.fa" of length 702931 # matching query-file "/home/user0/marino-data/practice/alignment/contigs.fa" # against subject-file "hybrid_vs_illumina.ntref" # COMPLETETIME /usr/bin/mummer hybrid_vs_illumina.ntref 0.40 # SPACE /usr/bin/mummer hybrid_vs_illumina.ntref 1.37 4: FINISHING DATA user0@head:~/marino-data/practice/alignment$

Velvet Assembly exercise user0@head:~/marino-data/practice/alignment$ mummerplot -layout hybrid_vs_illumina.delta

Exome capture alignment & variant calling exercise user0@head:~/marino-data/practice/alignment$ cd ../ExomeAnalysis/ user0@head:~/marino-data/practice/ExomeAnalysis$ 1. BWA alignment user0@head:~/marino-data/practice/ExomeAnalysis$ bwa aln chr22 Example.1.fastq > Example.1.sai ; bwa aln chr22 Example.2.fastq > Example.2.sai ; bwa sampe chr22 Example.1.sai Example.2.sai Example.1.fastq Example.2.fastq > Example.sam

Exome capture alignment & variant calling exercise 2. BAM conversion, sorting and indexing of files for variant detection user0@head:~/marino-data/practice/ExomeAnalysis$ samtools view -S -b -o Example.bam Example.sam ; samtools sort Example.bam Example.sorted ; samtools index Example.sorted.bam

Exome capture alignment & variant calling exercise 3. Variant detection using samtools multi-way pileup user0@head:~/marino-data/practice/ExomeAnalysis$ samtools mpileup -uf chr22.fasta Example.sorted.bam | bcftools view -bvcg - > Example.bcf ; bcftools view Example.bcf | vcfutils.pl varFilter -D100 > Example.vcf

Exome capture alignment & variant calling exercise 4. Indexing VCF file for IGV display user0@head:~/marino-data/practice/ExomeAnalysis$ bgzip -c Example.vcf > Example.vcf.gz ; tabix -p vcf Example.vcf.gz 5. IGV display user0@head:~/marino-data/practice/ExomeAnalysis$ igv.sh & File  Load Genome… Select chr22.genome and click Ok File  Load from File… Select Example.sorted.bam and click Ok Select Example.vcf.gz and click Ok Select nimbleGen_SeqCapEZ_exome_chr22.bed and click Ok

Exome capture alignment & variant calling exercise