Next-generation sequencing data analysis using open source software Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH ICGEB – Practical Course "Bioinformatics: Computer Methods in Molecular Biology” June 26-30 / 2017
NCBI SRA Portal
NCBI SRA Query
NCBI SRA 454 data
NCBI SRA 454 data
Getting data out of the NCBI SRA
File structure for the Demo user0@head:~$ ls igv marino-data ncbi public_html R workspace user0@head:~$
The FASTQ format http://en.wikipedia.org/wiki/FASTQ_format
Converting SRA to FASTQ user0@head:~$ cd marino-data/sra/ user0@head:~/marino-data/sra$ ls H_pylori_sequence.fasta SRR023794.sra user0@head:~/marino-data/sra$ fastq-dump --split-spot --skip-technical --clip SRR023794.sra Written 231208 spots for SRR023794.sra Written 231208 spots total user0@head:~/marino-data/sra$ head -n 4 SRR023794.fastq @SRR023794.1 FMQS1PV02F25Z5 length=86 GGGTAGGCACAGCGACTGTTCTTATCTTTTTGTGCCTTATATGCATATCCCAGATAGCGTCAATATCCTTAAAGAAGTCGGCACGC +SRR023794.1 FMQS1PV02F25Z5 length=86 7779@EEEFFFFFFFFFFFFFFFFEE55000@8E88::EEFFFFFFEDAAAEEFFFFFFFFFFFFFFF;==;=ADFFFFFFFFFFF
Quality control assessment with FASTQC http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
Running FASTQC user0@head:~/marino-data/sra$ fastqc SRR023794.fastq Started analysis of SRR023794.fastq Approx 5% complete for SRR023794.fastq Approx 10% complete for SRR023794.fastq Approx 15% complete for SRR023794.fastq Approx 20% complete for SRR023794.fastq Approx 25% complete for SRR023794.fastq Approx 30% complete for SRR023794.fastq Approx 35% complete for SRR023794.fastq Approx 40% complete for SRR023794.fastq Approx 45% complete for SRR023794.fastq Approx 50% complete for SRR023794.fastq Approx 55% complete for SRR023794.fastq Approx 60% complete for SRR023794.fastq Approx 65% complete for SRR023794.fastq Approx 70% complete for SRR023794.fastq Approx 75% complete for SRR023794.fastq Approx 80% complete for SRR023794.fastq Approx 85% complete for SRR023794.fastq Approx 90% complete for SRR023794.fastq Approx 95% complete for SRR023794.fastq Analysis complete for SRR023794.fastq user0@head :~/marino-data/sra$
Running FASTQC Open a web browser and point to (Remember to use ***your*** user instead user0): http://23.251.138.125/~user0/
Running FASTQC
Running FASTQC
Running FASTQC
Running FASTQC
Running FASTQC
Running FASTQC
Running FASTQC
Running FASTQC
Running FASTQC
Running FASTQC
Break…
Converting SRA to SFF (native format) user0@head:~/marino-data/sra$ sff-dump SRR023794.sra Written 231208 spots for SRR023794.sra Written 231208 spots total user0@head:~/marino-data/sra$
(- # Islands = Ne How many clones should we sequence? )) ( 1 - …according to the work of Lander and Waterman (1988), the number of “islands” or contigs formed from randomly collected sequences depends on: G = Genome Length L = Sequence Read Length N = Number of Sequences Collected T = Number of Basepairs of Overlap Needed (- )) LN G ( 1 - T L # Islands = Ne
5 Mbp Genome, 500 bp reads, 25 bp overlap # reads coverage % sequenced # contigs 2500 0.25 22.12 1971 5000 0.5 39.35 3109 10000 1 63.21 3867 20000 2 86.47 2991 30000 3 95.02 1735 40000 4 98.17 895 50000 5 99.33 433 60000 6 99.75 201 70000 7 99.91 91 80000 8 99.97 40 90000 9 99.99 17 100000 10 100.00 7
De novo sequence assembly Traditional WGS assemblers (Sanger and 454): Amos (UMD) Arachne (Broad) Celera assembler (JCVI/UMD) Newbler (454) Short-read assemblers (Illumina and Solid): SOAPdenovo (BGI) SSAKE/ABySS (GSC) VCAKE (UNC) Velvet (EBI) ALLPATHS (Broad) Euler-SR (UCSD)
De novo assembly with Newbler user0@head:~/marino-data/sra$ runAssembly -o /home/user0/marino-data/H_pylori_De_novo -cpu 8 SRR023794.sff
Comparative assembly with Newbler user0@head:~/marino-data/sra$ runMapping -o /home/user0/marino-data/H_pylori_comparative -cpu 8 H_pylori_sequence.fasta SRR023794.sff
De Novo and Comparative assemblies with Newbler user0@head:~/marino-data$ cat H_pylori_De_novo/454NewblerMetrics.txt largeContigMetrics { numberOfContigs = 30; numberOfBases = 1577109; avgContigSize = 52570; N50ContigSize = 125596; largestContigSize = 327954; Q40PlusBases = 1576070, 99.93%; Q39MinusBases = 1039, 0.07%; user0@head:~/marino-data$ cat H_pylori_comparative/454NewblerMetrics.txt numberOfContigs = 18; numberOfBases = 1560544; avgContigSize = 86696; N50ContigSize = 126162; largestContigSize = 329068; Q40PlusBases = 1558823, 99.89%; Q39MinusBases = 1721, 0.11%;
Celera Assembler Hybrid Assembly exercise user0@head:~/marino-data/sra$ cd ../practice/ user0@head:~/marino-data/practice$ ll total 194988 -rw-r--r-- 1 user0 user0 48489062 Apr 1 18:03 454.frg drwxr-xr-x 3 user0 user0 4096 Apr 1 18:03 ExomeAnalysis drwxr-xr-x 2 user0 user0 4096 Apr 1 18:18 alignment -rw-r--r-- 1 user0 user0 2951079 Apr 1 18:05 genome.fasta drwxr-xr-x 20 user0 user0 4096 Apr 1 18:05 hybrid_CA -rw-r--r-- 1 user0 user0 74104444 Apr 1 18:05 illumina_pe.fq -rw-r--r-- 1 user0 user0 37052222 Apr 1 18:03 read1_illumina.fq -rw-r--r-- 1 user0 user0 37052222 Apr 1 18:06 read2_illumina.fq user0@head:~/marino-data/practice$ user0@head:~/marino-data/practice$ runCA -p ca -d CA 454.frg
Celera Assembler Hybrid Assembly exercise user0@head:~/marino-data/practice$ ll total 195024 -rw-r--r-- 1 user0 user0 48489062 Apr 1 18:03 454.frg drwxr-xr-x 25 user0 user0 4096 Apr 3 17:55 CA drwxr-xr-x 3 user0 user0 4096 Apr 1 18:03 ExomeAnalysis drwxr-xr-x 2 user0 user0 4096 Apr 1 18:18 alignment -rw-r--r-- 1 user0 user0 2951079 Apr 1 18:05 genome.fasta drwxr-xr-x 20 user0 user0 4096 Apr 1 18:05 hybrid_CA -rw-r--r-- 1 user0 user0 74104444 Apr 1 18:05 illumina_pe.fq -rw-r--r-- 1 user0 user0 37052222 Apr 1 18:03 read1_illumina.fq -rw-r--r-- 1 user0 user0 37052222 Apr 1 18:06 read2_illumina.fq -rw------- 1 user0 user0 29194 Apr 3 17:55 runCA_user0.o207 user0@head:~/marino-data/practice$ cd CA/ user0@head:~/marino-data/practice/CA$ ls 0-mercounts 4-unitigger 7-0-CGW 8-consensus ca.ovlStore 0-mertrim 5-consensus 7-1-ECR 9-terminator ca.ovlStore.err 0-overlaptrim 5-consensus-coverage-stat 7-2-CGW ca.asm ca.ovlStore.list 0-overlaptrim-overlap 5-consensus-insert-sizes 7-3-ECR ca.gkpStore ca.qc 1-overlapper 5-consensus-split 7-4-CGW ca.gkpStore.err ca.tigStore 3-overlapcorrection 6-clonesize 7-CGW ca.gkpStore.errorLog runCA-logs user0@head:~/marino-data/practice/CA$ gatekeeper -dumpfrg ca.gkpStore > ca.frg Scanning store to find libraries used. Added 0 reads to maintain mate relationships. Dumping 0 fragments from unknown library (version 1 has these) Dumping 93153 fragments from library IID 1 user0@head:~/marino-data/practice/CA$ toAmos -a ca.asm -f ca.frg -o ca.afg Max ID: 227679 user0@head:~/marino-data/practice/CA$ mkdir ca.bnk
Celera Assembler Hybrid Assembly exercise user0@head:~/marino-data/practice/CA$ bank-transact -b ca.bnk -m ca.afg START DATE: Thu Apr 3 18:05:20 2015 Bank is: ca.bnk 0% 100% AFG .................................................. Messages read: 161335 Objects added: 161335 Objects deleted: 0 Objects replaced: 0 END DATE: Thu Apr 3 18:05:21 2015 user0@head:~/marino-data/practice/CA$ hawkeye ca.bnk & [1] 8469 user0@head:~/marino-data/practice/CA$ Opening ca.bnk... [0.06s] Indexing Contigs .......... [0.05s] 89659 reads in 239 contigs Indexing Scaffolds .......... [0.00s] 20 contigs in 5 scaffolds Indexing Libraries .......... [0.00s] 2 libraries Indexing Mates .......... [0.03s] 51784 mated reads in 67261 fragments Indexing Reads .......... [0.04s] 93154 reads Features not available Initialize Display .Loading AssemblyStats...[0.05s] .Loading Features... [0.00s] .Loading Libraries... [0.00s] .Loading Scaffolds... [0.00s] .Loading Contigs... [0.06s] ....Loading NCharts... [0.00s] . [0.11s] Loading Contig 1... [0.00s] 2 reads Loading reads... [0.00s] Total Load Time: [0.33s]
Celera Assembler Hybrid Assembly exercise
Break…
Velvet Assembly exercise user0@head:~/marino-data/practice/CA$ cd .. user0@head:~/marino-data/practice$ velveth velvet_asm 31 -fastq -shortPaired illumina_pe.fq [0.000000] Reading FastQ file illumina_pe.fq; [2.942968] 304046 sequences found [2.942990] Done [2.943043] Reading read set file velvet_asm/Sequences; [2.985566] 304046 sequences found [3.212025] Done [3.212076] 304046 sequences in total. [3.212138] Writing into roadmap file velvet_asm/Roadmaps... [3.380384] Inputting sequences... [3.380423] Inputting sequence 0 / 304046 [7.236115] === Sequences loaded in 3.855738 s [7.236234] Done inputting sequences [7.236244] Destroying splay table [7.246505] Splay table destroyed user0@head:~/marino-data/practice$ velvetg velvet_asm [0.000001] Reading roadmap file velvet_asm/Roadmaps [0.451835] 304046 roadmaps read [0.452168] Creating insertion markers [0.513922] Ordering insertion markers [0.739075] Counting preNodes [0.783270] 420316 preNodes counted, creating them now [1.586141] Adjusting marker info... [12.627305] Concatenation over! [12.628529] Writing contigs into velvet_asm/contigs.fa... [12.668053] Writing into stats file velvet_asm/stats.txt... [12.669974] Writing into graph file velvet_asm/LastGraph... Final graph has 883 nodes and n50 of 6491, max 21013, total 692956, using 0/304046 reads user0@head:~/marino-data/practice$
Velvet Assembly exercise user0@head:~/marino-data/practice$ cd alignment/ user0@head:~/marino-data/practice/alignment$ cp ../hybrid_CA/9-terminator/hybrid.scf.fasta . user0@head:~/marino-data/practice/alignment$ cp ../velvet_asm/contigs.fa . user0@head:~/marino-data/practice/alignment$ nucmer -p hybrid_vs_illumina hybrid.scf.fasta contigs.fa 1: PREPARING DATA 2,3: RUNNING mummer AND CREATING CLUSTERS # reading input file "hybrid_vs_illumina.ntref" of length 697317 # construct suffix tree for sequence of length 697317 # (maximum reference length is 536870908) # (maximum query length is 4294967295) # process 6973 characters per dot #.................................................................................................... # CONSTRUCTIONTIME /usr/bin/mummer hybrid_vs_illumina.ntref 0.14 # reading input file "/home/user0/marino-data/practice/alignment/contigs.fa" of length 702931 # matching query-file "/home/user0/marino-data/practice/alignment/contigs.fa" # against subject-file "hybrid_vs_illumina.ntref" # COMPLETETIME /usr/bin/mummer hybrid_vs_illumina.ntref 0.40 # SPACE /usr/bin/mummer hybrid_vs_illumina.ntref 1.37 4: FINISHING DATA user0@head:~/marino-data/practice/alignment$
Velvet Assembly exercise user0@head:~/marino-data/practice/alignment$ mummerplot -layout hybrid_vs_illumina.delta
Exome capture alignment & variant calling exercise user0@head:~/marino-data/practice/alignment$ cd ../ExomeAnalysis/ user0@head:~/marino-data/practice/ExomeAnalysis$ 1. BWA alignment user0@head:~/marino-data/practice/ExomeAnalysis$ bwa aln chr22 Example.1.fastq > Example.1.sai ; bwa aln chr22 Example.2.fastq > Example.2.sai ; bwa sampe chr22 Example.1.sai Example.2.sai Example.1.fastq Example.2.fastq > Example.sam
Exome capture alignment & variant calling exercise 2. BAM conversion, sorting and indexing of files for variant detection user0@head:~/marino-data/practice/ExomeAnalysis$ samtools view -S -b -o Example.bam Example.sam ; samtools sort Example.bam Example.sorted ; samtools index Example.sorted.bam
Exome capture alignment & variant calling exercise 3. Variant detection using samtools multi-way pileup user0@head:~/marino-data/practice/ExomeAnalysis$ samtools mpileup -uf chr22.fasta Example.sorted.bam | bcftools view -bvcg - > Example.bcf ; bcftools view Example.bcf | vcfutils.pl varFilter -D100 > Example.vcf
Exome capture alignment & variant calling exercise 4. Indexing VCF file for IGV display user0@head:~/marino-data/practice/ExomeAnalysis$ bgzip -c Example.vcf > Example.vcf.gz ; tabix -p vcf Example.vcf.gz 5. IGV display user0@head:~/marino-data/practice/ExomeAnalysis$ igv.sh & File Load Genome… Select chr22.genome and click Ok File Load from File… Select Example.sorted.bam and click Ok Select Example.vcf.gz and click Ok Select nimbleGen_SeqCapEZ_exome_chr22.bed and click Ok
Exome capture alignment & variant calling exercise