Next-generation sequencing data analysis using open source software

Next-generation sequencing data analysis using open source software
Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH ICGEB – Practical Course "Bioinformatics: Computer Methods in Molecular Biology” June / 2017

NCBI SRA Portal

NCBI SRA Query

NCBI SRA 454 data

Getting data out of the NCBI SRA

File structure for the Demo
ls igv marino-data ncbi public_html R workspace

The FASTQ format

Converting SRA to FASTQ
cd marino-data/sra/ ls H_pylori_sequence.fasta SRR sra fastq-dump --split-spot --skip-technical --clip SRR sra Written spots for SRR sra Written spots total head -n 4 SRR fastq @SRR FMQS1PV02F25Z5 length=86 GGGTAGGCACAGCGACTGTTCTTATCTTTTTGTGCCTTATATGCATATCCCAGATAGCGTCAATATCCTTAAAGAAGTCGGCACGC +SRR FMQS1PV02F25Z5 length=86

Quality control assessment with FASTQC

Running FASTQC fastqc SRR fastq Started analysis of SRR fastq Approx 5% complete for SRR fastq Approx 10% complete for SRR fastq Approx 15% complete for SRR fastq Approx 20% complete for SRR fastq Approx 25% complete for SRR fastq Approx 30% complete for SRR fastq Approx 35% complete for SRR fastq Approx 40% complete for SRR fastq Approx 45% complete for SRR fastq Approx 50% complete for SRR fastq Approx 55% complete for SRR fastq Approx 60% complete for SRR fastq Approx 65% complete for SRR fastq Approx 70% complete for SRR fastq Approx 75% complete for SRR fastq Approx 80% complete for SRR fastq Approx 85% complete for SRR fastq Approx 90% complete for SRR fastq Approx 95% complete for SRR fastq Analysis complete for SRR fastq :~/marino-data/sra$

Running FASTQC Open a web browser and point to (Remember to use ***your*** user instead user0):

Running FASTQC

Break…

Converting SRA to SFF (native format)
sff-dump SRR sra Written spots for SRR sra Written spots total

(- # Islands = Ne How many clones should we sequence? )) ( 1 -
…according to the work of Lander and Waterman (1988), the number of “islands” or contigs formed from randomly collected sequences depends on: G = Genome Length L = Sequence Read Length N = Number of Sequences Collected T = Number of Basepairs of Overlap Needed (- )) LN G ( 1 - T L # Islands = Ne

5 Mbp Genome, 500 bp reads, 25 bp overlap
# reads coverage % sequenced # contigs

De novo sequence assembly
Traditional WGS assemblers (Sanger and 454): Amos (UMD) Arachne (Broad) Celera assembler (JCVI/UMD) Newbler (454) Short-read assemblers (Illumina and Solid): SOAPdenovo (BGI) SSAKE/ABySS (GSC) VCAKE (UNC) Velvet (EBI) ALLPATHS (Broad) Euler-SR (UCSD)

De novo assembly with Newbler
runAssembly -o /home/user0/marino-data/H_pylori_De_novo -cpu 8 SRR sff

Comparative assembly with Newbler
runMapping -o /home/user0/marino-data/H_pylori_comparative -cpu 8 H_pylori_sequence.fasta SRR sff

De Novo and Comparative assemblies with Newbler
cat H_pylori_De_novo/454NewblerMetrics.txt largeContigMetrics { numberOfContigs = 30; numberOfBases = ; avgContigSize = 52570; N50ContigSize = ; largestContigSize = ; Q40PlusBases = , 99.93%; Q39MinusBases = 1039, 0.07%; cat H_pylori_comparative/454NewblerMetrics.txt numberOfContigs = 18; numberOfBases = ; avgContigSize = 86696; N50ContigSize = ; largestContigSize = ; Q40PlusBases = , 99.89%; Q39MinusBases = 1721, 0.11%;

Celera Assembler Hybrid Assembly exercise
cd ../practice/ ll total -rw-r--r-- 1 user0 user Apr 1 18: frg drwxr-xr-x 3 user0 user Apr 1 18:03 ExomeAnalysis drwxr-xr-x 2 user0 user Apr 1 18:18 alignment -rw-r--r-- 1 user0 user Apr 1 18:05 genome.fasta drwxr-xr-x 20 user0 user Apr 1 18:05 hybrid_CA -rw-r--r-- 1 user0 user Apr 1 18:05 illumina_pe.fq -rw-r--r-- 1 user0 user Apr 1 18:03 read1_illumina.fq -rw-r--r-- 1 user0 user Apr 1 18:06 read2_illumina.fq runCA -p ca -d CA 454.frg

ll total -rw-r--r-- 1 user0 user Apr 1 18: frg drwxr-xr-x 25 user0 user Apr 3 17:55 CA drwxr-xr-x 3 user0 user Apr 1 18:03 ExomeAnalysis drwxr-xr-x 2 user0 user Apr 1 18:18 alignment -rw-r--r-- 1 user0 user Apr 1 18:05 genome.fasta drwxr-xr-x 20 user0 user Apr 1 18:05 hybrid_CA -rw-r--r-- 1 user0 user Apr 1 18:05 illumina_pe.fq -rw-r--r-- 1 user0 user Apr 1 18:03 read1_illumina.fq -rw-r--r-- 1 user0 user Apr 1 18:06 read2_illumina.fq -rw user0 user Apr 3 17:55 runCA_user0.o207 cd CA/ ls 0-mercounts unitigger CGW 8-consensus ca.ovlStore 0-mertrim consensus ECR 9-terminator ca.ovlStore.err 0-overlaptrim consensus-coverage-stat 7-2-CGW ca.asm ca.ovlStore.list 0-overlaptrim-overlap 5-consensus-insert-sizes ECR ca.gkpStore ca.qc 1-overlapper consensus-split CGW ca.gkpStore.err ca.tigStore 3-overlapcorrection 6-clonesize CGW ca.gkpStore.errorLog runCA-logs gatekeeper -dumpfrg ca.gkpStore > ca.frg Scanning store to find libraries used. Added 0 reads to maintain mate relationships. Dumping 0 fragments from unknown library (version 1 has these) Dumping fragments from library IID 1 toAmos -a ca.asm -f ca.frg -o ca.afg Max ID: mkdir ca.bnk

bank-transact -b ca.bnk -m ca.afg START DATE: Thu Apr 3 18:05: Bank is: ca.bnk 0% % AFG Messages read: Objects added: Objects deleted: 0 Objects replaced: 0 END DATE: Thu Apr 3 18:05: hawkeye ca.bnk & [1] 8469 Opening ca.bnk... [0.06s] Indexing Contigs [0.05s] reads in 239 contigs Indexing Scaffolds [0.00s] 20 contigs in 5 scaffolds Indexing Libraries [0.00s] 2 libraries Indexing Mates [0.03s] mated reads in fragments Indexing Reads [0.04s] reads Features not available Initialize Display .Loading AssemblyStats...[0.05s] .Loading Features [0.00s] .Loading Libraries [0.00s] .Loading Scaffolds [0.00s] .Loading Contigs [0.06s] ....Loading NCharts [0.00s] . [0.11s] Loading Contig 1... [0.00s] 2 reads Loading reads [0.00s] Total Load Time: [0.33s]

Break…

Velvet Assembly exercise
cd .. velveth velvet_asm 31 -fastq -shortPaired illumina_pe.fq [ ] Reading FastQ file illumina_pe.fq; [ ] sequences found [ ] Done [ ] Reading read set file velvet_asm/Sequences; [ ] sequences found [ ] Done [ ] sequences in total. [ ] Writing into roadmap file velvet_asm/Roadmaps... [ ] Inputting sequences... [ ] Inputting sequence 0 / [ ] === Sequences loaded in s [ ] Done inputting sequences [ ] Destroying splay table [ ] Splay table destroyed velvetg velvet_asm [ ] Reading roadmap file velvet_asm/Roadmaps [ ] roadmaps read [ ] Creating insertion markers [ ] Ordering insertion markers [ ] Counting preNodes [ ] preNodes counted, creating them now [ ] Adjusting marker info... [ ] Concatenation over! [ ] Writing contigs into velvet_asm/contigs.fa... [ ] Writing into stats file velvet_asm/stats.txt... [ ] Writing into graph file velvet_asm/LastGraph... Final graph has 883 nodes and n50 of 6491, max 21013, total , using 0/ reads

cd alignment/ cp ../hybrid_CA/9-terminator/hybrid.scf.fasta . cp ../velvet_asm/contigs.fa . nucmer -p hybrid_vs_illumina hybrid.scf.fasta contigs.fa 1: PREPARING DATA 2,3: RUNNING mummer AND CREATING CLUSTERS # reading input file "hybrid_vs_illumina.ntref" of length # construct suffix tree for sequence of length # (maximum reference length is ) # (maximum query length is ) # process 6973 characters per dot # # CONSTRUCTIONTIME /usr/bin/mummer hybrid_vs_illumina.ntref 0.14 # reading input file "/home/user0/marino-data/practice/alignment/contigs.fa" of length # matching query-file "/home/user0/marino-data/practice/alignment/contigs.fa" # against subject-file "hybrid_vs_illumina.ntref" # COMPLETETIME /usr/bin/mummer hybrid_vs_illumina.ntref 0.40 # SPACE /usr/bin/mummer hybrid_vs_illumina.ntref 1.37 4: FINISHING DATA

mummerplot -layout hybrid_vs_illumina.delta

Exome capture alignment & variant calling exercise
cd ../ExomeAnalysis/ 1. BWA alignment bwa aln chr22 Example.1.fastq > Example.1.sai ; bwa aln chr22 Example.2.fastq > Example.2.sai ; bwa sampe chr22 Example.1.sai Example.2.sai Example.1.fastq Example.2.fastq > Example.sam

2. BAM conversion, sorting and indexing of files for variant detection samtools view -S -b -o Example.bam Example.sam ; samtools sort Example.bam Example.sorted ; samtools index Example.sorted.bam

3. Variant detection using samtools multi-way pileup samtools mpileup -uf chr22.fasta Example.sorted.bam | bcftools view -bvcg - > Example.bcf ; bcftools view Example.bcf | vcfutils.pl varFilter -D100 > Example.vcf

4. Indexing VCF file for IGV display bgzip -c Example.vcf > Example.vcf.gz ; tabix -p vcf Example.vcf.gz 5. IGV display igv.sh & File  Load Genome… Select chr22.genome and click Ok File  Load from File… Select Example.sorted.bam and click Ok Select Example.vcf.gz and click Ok Select nimbleGen_SeqCapEZ_exome_chr22.bed and click Ok

Next-generation sequencing data analysis using open source software

Similar presentations

Presentation on theme: "Next-generation sequencing data analysis using open source software"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Next-generation sequencing data analysis using open source software

Similar presentations

Presentation on theme: "Next-generation sequencing data analysis using open source software"— Presentation transcript:

Similar presentations

About project

Feedback