Presentation is loading. Please wait.

Presentation is loading. Please wait.

Next-generation sequencing data analysis using open source software

Similar presentations


Presentation on theme: "Next-generation sequencing data analysis using open source software"— Presentation transcript:

1 Next-generation sequencing data analysis using open source software
Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH ICGEB – Practical Course "Bioinformatics: Computer Methods in Molecular Biology” June / 2017

2 NCBI SRA Portal

3 NCBI SRA Query

4 NCBI SRA 454 data

5 NCBI SRA 454 data

6 Getting data out of the NCBI SRA

7 File structure for the Demo
ls igv marino-data ncbi public_html R workspace

8 The FASTQ format

9 Converting SRA to FASTQ
cd marino-data/sra/ ls H_pylori_sequence.fasta SRR sra fastq-dump --split-spot --skip-technical --clip SRR sra Written spots for SRR sra Written spots total head -n 4 SRR fastq @SRR FMQS1PV02F25Z5 length=86 GGGTAGGCACAGCGACTGTTCTTATCTTTTTGTGCCTTATATGCATATCCCAGATAGCGTCAATATCCTTAAAGAAGTCGGCACGC +SRR FMQS1PV02F25Z5 length=86

10 Quality control assessment with FASTQC

11 Running FASTQC fastqc SRR fastq Started analysis of SRR fastq Approx 5% complete for SRR fastq Approx 10% complete for SRR fastq Approx 15% complete for SRR fastq Approx 20% complete for SRR fastq Approx 25% complete for SRR fastq Approx 30% complete for SRR fastq Approx 35% complete for SRR fastq Approx 40% complete for SRR fastq Approx 45% complete for SRR fastq Approx 50% complete for SRR fastq Approx 55% complete for SRR fastq Approx 60% complete for SRR fastq Approx 65% complete for SRR fastq Approx 70% complete for SRR fastq Approx 75% complete for SRR fastq Approx 80% complete for SRR fastq Approx 85% complete for SRR fastq Approx 90% complete for SRR fastq Approx 95% complete for SRR fastq Analysis complete for SRR fastq :~/marino-data/sra$

12 Running FASTQC Open a web browser and point to (Remember to use ***your*** user instead user0):

13 Running FASTQC

14 Running FASTQC

15 Running FASTQC

16 Running FASTQC

17 Running FASTQC

18 Running FASTQC

19 Running FASTQC

20 Running FASTQC

21 Running FASTQC

22 Running FASTQC

23 Break…

24 Converting SRA to SFF (native format)
sff-dump SRR sra Written spots for SRR sra Written spots total

25 (- # Islands = Ne How many clones should we sequence? )) ( 1 -
…according to the work of Lander and Waterman (1988), the number of “islands” or contigs formed from randomly collected sequences depends on: G = Genome Length L = Sequence Read Length N = Number of Sequences Collected T = Number of Basepairs of Overlap Needed (- )) LN G ( 1 - T L # Islands = Ne

26 5 Mbp Genome, 500 bp reads, 25 bp overlap
# reads coverage % sequenced # contigs

27 De novo sequence assembly
Traditional WGS assemblers (Sanger and 454): Amos (UMD) Arachne (Broad) Celera assembler (JCVI/UMD) Newbler (454) Short-read assemblers (Illumina and Solid): SOAPdenovo (BGI) SSAKE/ABySS (GSC) VCAKE (UNC) Velvet (EBI) ALLPATHS (Broad) Euler-SR (UCSD)

28 De novo assembly with Newbler
runAssembly -o /home/user0/marino-data/H_pylori_De_novo -cpu 8 SRR sff

29 Comparative assembly with Newbler
runMapping -o /home/user0/marino-data/H_pylori_comparative -cpu 8 H_pylori_sequence.fasta SRR sff

30 De Novo and Comparative assemblies with Newbler
cat H_pylori_De_novo/454NewblerMetrics.txt largeContigMetrics { numberOfContigs = 30; numberOfBases = ; avgContigSize = 52570; N50ContigSize = ; largestContigSize = ; Q40PlusBases = , 99.93%; Q39MinusBases = 1039, 0.07%; cat H_pylori_comparative/454NewblerMetrics.txt numberOfContigs = 18; numberOfBases = ; avgContigSize = 86696; N50ContigSize = ; largestContigSize = ; Q40PlusBases = , 99.89%; Q39MinusBases = 1721, 0.11%;

31 Celera Assembler Hybrid Assembly exercise
cd ../practice/ ll total -rw-r--r-- 1 user0 user Apr 1 18: frg drwxr-xr-x 3 user0 user Apr 1 18:03 ExomeAnalysis drwxr-xr-x 2 user0 user Apr 1 18:18 alignment -rw-r--r-- 1 user0 user Apr 1 18:05 genome.fasta drwxr-xr-x 20 user0 user Apr 1 18:05 hybrid_CA -rw-r--r-- 1 user0 user Apr 1 18:05 illumina_pe.fq -rw-r--r-- 1 user0 user Apr 1 18:03 read1_illumina.fq -rw-r--r-- 1 user0 user Apr 1 18:06 read2_illumina.fq runCA -p ca -d CA 454.frg

32 Celera Assembler Hybrid Assembly exercise
ll total -rw-r--r-- 1 user0 user Apr 1 18: frg drwxr-xr-x 25 user0 user Apr 3 17:55 CA drwxr-xr-x 3 user0 user Apr 1 18:03 ExomeAnalysis drwxr-xr-x 2 user0 user Apr 1 18:18 alignment -rw-r--r-- 1 user0 user Apr 1 18:05 genome.fasta drwxr-xr-x 20 user0 user Apr 1 18:05 hybrid_CA -rw-r--r-- 1 user0 user Apr 1 18:05 illumina_pe.fq -rw-r--r-- 1 user0 user Apr 1 18:03 read1_illumina.fq -rw-r--r-- 1 user0 user Apr 1 18:06 read2_illumina.fq -rw user0 user Apr 3 17:55 runCA_user0.o207 cd CA/ ls 0-mercounts unitigger CGW 8-consensus ca.ovlStore 0-mertrim consensus ECR 9-terminator ca.ovlStore.err 0-overlaptrim consensus-coverage-stat 7-2-CGW ca.asm ca.ovlStore.list 0-overlaptrim-overlap 5-consensus-insert-sizes ECR ca.gkpStore ca.qc 1-overlapper consensus-split CGW ca.gkpStore.err ca.tigStore 3-overlapcorrection 6-clonesize CGW ca.gkpStore.errorLog runCA-logs gatekeeper -dumpfrg ca.gkpStore > ca.frg Scanning store to find libraries used. Added 0 reads to maintain mate relationships. Dumping 0 fragments from unknown library (version 1 has these) Dumping fragments from library IID 1 toAmos -a ca.asm -f ca.frg -o ca.afg Max ID: mkdir ca.bnk

33 Celera Assembler Hybrid Assembly exercise
bank-transact -b ca.bnk -m ca.afg START DATE: Thu Apr 3 18:05: Bank is: ca.bnk 0% % AFG Messages read: Objects added: Objects deleted: 0 Objects replaced: 0 END DATE: Thu Apr 3 18:05: hawkeye ca.bnk & [1] 8469 Opening ca.bnk... [0.06s] Indexing Contigs [0.05s] reads in 239 contigs Indexing Scaffolds [0.00s] 20 contigs in 5 scaffolds Indexing Libraries [0.00s] 2 libraries Indexing Mates [0.03s] mated reads in fragments Indexing Reads [0.04s] reads Features not available Initialize Display .Loading AssemblyStats...[0.05s] .Loading Features [0.00s] .Loading Libraries [0.00s] .Loading Scaffolds [0.00s] .Loading Contigs [0.06s] ....Loading NCharts [0.00s] . [0.11s] Loading Contig 1... [0.00s] 2 reads Loading reads [0.00s] Total Load Time: [0.33s]

34 Celera Assembler Hybrid Assembly exercise

35 Break…

36 Velvet Assembly exercise
cd .. velveth velvet_asm 31 -fastq -shortPaired illumina_pe.fq [ ] Reading FastQ file illumina_pe.fq; [ ] sequences found [ ] Done [ ] Reading read set file velvet_asm/Sequences; [ ] sequences found [ ] Done [ ] sequences in total. [ ] Writing into roadmap file velvet_asm/Roadmaps... [ ] Inputting sequences... [ ] Inputting sequence 0 / [ ] === Sequences loaded in s [ ] Done inputting sequences [ ] Destroying splay table [ ] Splay table destroyed velvetg velvet_asm [ ] Reading roadmap file velvet_asm/Roadmaps [ ] roadmaps read [ ] Creating insertion markers [ ] Ordering insertion markers [ ] Counting preNodes [ ] preNodes counted, creating them now [ ] Adjusting marker info... [ ] Concatenation over! [ ] Writing contigs into velvet_asm/contigs.fa... [ ] Writing into stats file velvet_asm/stats.txt... [ ] Writing into graph file velvet_asm/LastGraph... Final graph has 883 nodes and n50 of 6491, max 21013, total , using 0/ reads

37 Velvet Assembly exercise
cd alignment/ cp ../hybrid_CA/9-terminator/hybrid.scf.fasta . cp ../velvet_asm/contigs.fa . nucmer -p hybrid_vs_illumina hybrid.scf.fasta contigs.fa 1: PREPARING DATA 2,3: RUNNING mummer AND CREATING CLUSTERS # reading input file "hybrid_vs_illumina.ntref" of length # construct suffix tree for sequence of length # (maximum reference length is ) # (maximum query length is ) # process 6973 characters per dot # # CONSTRUCTIONTIME /usr/bin/mummer hybrid_vs_illumina.ntref 0.14 # reading input file "/home/user0/marino-data/practice/alignment/contigs.fa" of length # matching query-file "/home/user0/marino-data/practice/alignment/contigs.fa" # against subject-file "hybrid_vs_illumina.ntref" # COMPLETETIME /usr/bin/mummer hybrid_vs_illumina.ntref 0.40 # SPACE /usr/bin/mummer hybrid_vs_illumina.ntref 1.37 4: FINISHING DATA

38 Velvet Assembly exercise
mummerplot -layout hybrid_vs_illumina.delta

39 Exome capture alignment & variant calling exercise
cd ../ExomeAnalysis/ 1. BWA alignment bwa aln chr22 Example.1.fastq > Example.1.sai ; bwa aln chr22 Example.2.fastq > Example.2.sai ; bwa sampe chr22 Example.1.sai Example.2.sai Example.1.fastq Example.2.fastq > Example.sam

40 Exome capture alignment & variant calling exercise
2. BAM conversion, sorting and indexing of files for variant detection samtools view -S -b -o Example.bam Example.sam ; samtools sort Example.bam Example.sorted ; samtools index Example.sorted.bam

41 Exome capture alignment & variant calling exercise
3. Variant detection using samtools multi-way pileup samtools mpileup -uf chr22.fasta Example.sorted.bam | bcftools view -bvcg - > Example.bcf ; bcftools view Example.bcf | vcfutils.pl varFilter -D100 > Example.vcf

42 Exome capture alignment & variant calling exercise
4. Indexing VCF file for IGV display bgzip -c Example.vcf > Example.vcf.gz ; tabix -p vcf Example.vcf.gz 5. IGV display igv.sh & File  Load Genome… Select chr22.genome and click Ok File  Load from File… Select Example.sorted.bam and click Ok Select Example.vcf.gz and click Ok Select nimbleGen_SeqCapEZ_exome_chr22.bed and click Ok

43 Exome capture alignment & variant calling exercise


Download ppt "Next-generation sequencing data analysis using open source software"

Similar presentations


Ads by Google