Download presentation
Presentation is loading. Please wait.
Published byCharleen Wilkerson Modified over 6 years ago
1
Next-generation sequencing data analysis using open source software
Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH ICGEB – Practical Course "Bioinformatics: Computer Methods in Molecular Biology” June / 2017
2
NCBI SRA Portal
3
NCBI SRA Query
4
NCBI SRA 454 data
5
NCBI SRA 454 data
6
Getting data out of the NCBI SRA
7
File structure for the Demo
ls igv marino-data ncbi public_html R workspace
8
The FASTQ format
9
Converting SRA to FASTQ
cd marino-data/sra/ ls H_pylori_sequence.fasta SRR sra fastq-dump --split-spot --skip-technical --clip SRR sra Written spots for SRR sra Written spots total head -n 4 SRR fastq @SRR FMQS1PV02F25Z5 length=86 GGGTAGGCACAGCGACTGTTCTTATCTTTTTGTGCCTTATATGCATATCCCAGATAGCGTCAATATCCTTAAAGAAGTCGGCACGC +SRR FMQS1PV02F25Z5 length=86
10
Quality control assessment with FASTQC
11
Running FASTQC fastqc SRR fastq Started analysis of SRR fastq Approx 5% complete for SRR fastq Approx 10% complete for SRR fastq Approx 15% complete for SRR fastq Approx 20% complete for SRR fastq Approx 25% complete for SRR fastq Approx 30% complete for SRR fastq Approx 35% complete for SRR fastq Approx 40% complete for SRR fastq Approx 45% complete for SRR fastq Approx 50% complete for SRR fastq Approx 55% complete for SRR fastq Approx 60% complete for SRR fastq Approx 65% complete for SRR fastq Approx 70% complete for SRR fastq Approx 75% complete for SRR fastq Approx 80% complete for SRR fastq Approx 85% complete for SRR fastq Approx 90% complete for SRR fastq Approx 95% complete for SRR fastq Analysis complete for SRR fastq :~/marino-data/sra$
12
Running FASTQC Open a web browser and point to (Remember to use ***your*** user instead user0):
13
Running FASTQC
14
Running FASTQC
15
Running FASTQC
16
Running FASTQC
17
Running FASTQC
18
Running FASTQC
19
Running FASTQC
20
Running FASTQC
21
Running FASTQC
22
Running FASTQC
23
Break…
24
Converting SRA to SFF (native format)
sff-dump SRR sra Written spots for SRR sra Written spots total
25
(- # Islands = Ne How many clones should we sequence? )) ( 1 -
…according to the work of Lander and Waterman (1988), the number of “islands” or contigs formed from randomly collected sequences depends on: G = Genome Length L = Sequence Read Length N = Number of Sequences Collected T = Number of Basepairs of Overlap Needed (- )) LN G ( 1 - T L # Islands = Ne
26
5 Mbp Genome, 500 bp reads, 25 bp overlap
# reads coverage % sequenced # contigs
27
De novo sequence assembly
Traditional WGS assemblers (Sanger and 454): Amos (UMD) Arachne (Broad) Celera assembler (JCVI/UMD) Newbler (454) Short-read assemblers (Illumina and Solid): SOAPdenovo (BGI) SSAKE/ABySS (GSC) VCAKE (UNC) Velvet (EBI) ALLPATHS (Broad) Euler-SR (UCSD)
28
De novo assembly with Newbler
runAssembly -o /home/user0/marino-data/H_pylori_De_novo -cpu 8 SRR sff
29
Comparative assembly with Newbler
runMapping -o /home/user0/marino-data/H_pylori_comparative -cpu 8 H_pylori_sequence.fasta SRR sff
30
De Novo and Comparative assemblies with Newbler
cat H_pylori_De_novo/454NewblerMetrics.txt largeContigMetrics { numberOfContigs = 30; numberOfBases = ; avgContigSize = 52570; N50ContigSize = ; largestContigSize = ; Q40PlusBases = , 99.93%; Q39MinusBases = 1039, 0.07%; cat H_pylori_comparative/454NewblerMetrics.txt numberOfContigs = 18; numberOfBases = ; avgContigSize = 86696; N50ContigSize = ; largestContigSize = ; Q40PlusBases = , 99.89%; Q39MinusBases = 1721, 0.11%;
31
Celera Assembler Hybrid Assembly exercise
cd ../practice/ ll total -rw-r--r-- 1 user0 user Apr 1 18: frg drwxr-xr-x 3 user0 user Apr 1 18:03 ExomeAnalysis drwxr-xr-x 2 user0 user Apr 1 18:18 alignment -rw-r--r-- 1 user0 user Apr 1 18:05 genome.fasta drwxr-xr-x 20 user0 user Apr 1 18:05 hybrid_CA -rw-r--r-- 1 user0 user Apr 1 18:05 illumina_pe.fq -rw-r--r-- 1 user0 user Apr 1 18:03 read1_illumina.fq -rw-r--r-- 1 user0 user Apr 1 18:06 read2_illumina.fq runCA -p ca -d CA 454.frg
32
Celera Assembler Hybrid Assembly exercise
ll total -rw-r--r-- 1 user0 user Apr 1 18: frg drwxr-xr-x 25 user0 user Apr 3 17:55 CA drwxr-xr-x 3 user0 user Apr 1 18:03 ExomeAnalysis drwxr-xr-x 2 user0 user Apr 1 18:18 alignment -rw-r--r-- 1 user0 user Apr 1 18:05 genome.fasta drwxr-xr-x 20 user0 user Apr 1 18:05 hybrid_CA -rw-r--r-- 1 user0 user Apr 1 18:05 illumina_pe.fq -rw-r--r-- 1 user0 user Apr 1 18:03 read1_illumina.fq -rw-r--r-- 1 user0 user Apr 1 18:06 read2_illumina.fq -rw user0 user Apr 3 17:55 runCA_user0.o207 cd CA/ ls 0-mercounts unitigger CGW 8-consensus ca.ovlStore 0-mertrim consensus ECR 9-terminator ca.ovlStore.err 0-overlaptrim consensus-coverage-stat 7-2-CGW ca.asm ca.ovlStore.list 0-overlaptrim-overlap 5-consensus-insert-sizes ECR ca.gkpStore ca.qc 1-overlapper consensus-split CGW ca.gkpStore.err ca.tigStore 3-overlapcorrection 6-clonesize CGW ca.gkpStore.errorLog runCA-logs gatekeeper -dumpfrg ca.gkpStore > ca.frg Scanning store to find libraries used. Added 0 reads to maintain mate relationships. Dumping 0 fragments from unknown library (version 1 has these) Dumping fragments from library IID 1 toAmos -a ca.asm -f ca.frg -o ca.afg Max ID: mkdir ca.bnk
33
Celera Assembler Hybrid Assembly exercise
bank-transact -b ca.bnk -m ca.afg START DATE: Thu Apr 3 18:05: Bank is: ca.bnk 0% % AFG Messages read: Objects added: Objects deleted: 0 Objects replaced: 0 END DATE: Thu Apr 3 18:05: hawkeye ca.bnk & [1] 8469 Opening ca.bnk... [0.06s] Indexing Contigs [0.05s] reads in 239 contigs Indexing Scaffolds [0.00s] 20 contigs in 5 scaffolds Indexing Libraries [0.00s] 2 libraries Indexing Mates [0.03s] mated reads in fragments Indexing Reads [0.04s] reads Features not available Initialize Display .Loading AssemblyStats...[0.05s] .Loading Features [0.00s] .Loading Libraries [0.00s] .Loading Scaffolds [0.00s] .Loading Contigs [0.06s] ....Loading NCharts [0.00s] . [0.11s] Loading Contig 1... [0.00s] 2 reads Loading reads [0.00s] Total Load Time: [0.33s]
34
Celera Assembler Hybrid Assembly exercise
35
Break…
36
Velvet Assembly exercise
cd .. velveth velvet_asm 31 -fastq -shortPaired illumina_pe.fq [ ] Reading FastQ file illumina_pe.fq; [ ] sequences found [ ] Done [ ] Reading read set file velvet_asm/Sequences; [ ] sequences found [ ] Done [ ] sequences in total. [ ] Writing into roadmap file velvet_asm/Roadmaps... [ ] Inputting sequences... [ ] Inputting sequence 0 / [ ] === Sequences loaded in s [ ] Done inputting sequences [ ] Destroying splay table [ ] Splay table destroyed velvetg velvet_asm [ ] Reading roadmap file velvet_asm/Roadmaps [ ] roadmaps read [ ] Creating insertion markers [ ] Ordering insertion markers [ ] Counting preNodes [ ] preNodes counted, creating them now [ ] Adjusting marker info... [ ] Concatenation over! [ ] Writing contigs into velvet_asm/contigs.fa... [ ] Writing into stats file velvet_asm/stats.txt... [ ] Writing into graph file velvet_asm/LastGraph... Final graph has 883 nodes and n50 of 6491, max 21013, total , using 0/ reads
37
Velvet Assembly exercise
cd alignment/ cp ../hybrid_CA/9-terminator/hybrid.scf.fasta . cp ../velvet_asm/contigs.fa . nucmer -p hybrid_vs_illumina hybrid.scf.fasta contigs.fa 1: PREPARING DATA 2,3: RUNNING mummer AND CREATING CLUSTERS # reading input file "hybrid_vs_illumina.ntref" of length # construct suffix tree for sequence of length # (maximum reference length is ) # (maximum query length is ) # process 6973 characters per dot # # CONSTRUCTIONTIME /usr/bin/mummer hybrid_vs_illumina.ntref 0.14 # reading input file "/home/user0/marino-data/practice/alignment/contigs.fa" of length # matching query-file "/home/user0/marino-data/practice/alignment/contigs.fa" # against subject-file "hybrid_vs_illumina.ntref" # COMPLETETIME /usr/bin/mummer hybrid_vs_illumina.ntref 0.40 # SPACE /usr/bin/mummer hybrid_vs_illumina.ntref 1.37 4: FINISHING DATA
38
Velvet Assembly exercise
mummerplot -layout hybrid_vs_illumina.delta
39
Exome capture alignment & variant calling exercise
cd ../ExomeAnalysis/ 1. BWA alignment bwa aln chr22 Example.1.fastq > Example.1.sai ; bwa aln chr22 Example.2.fastq > Example.2.sai ; bwa sampe chr22 Example.1.sai Example.2.sai Example.1.fastq Example.2.fastq > Example.sam
40
Exome capture alignment & variant calling exercise
2. BAM conversion, sorting and indexing of files for variant detection samtools view -S -b -o Example.bam Example.sam ; samtools sort Example.bam Example.sorted ; samtools index Example.sorted.bam
41
Exome capture alignment & variant calling exercise
3. Variant detection using samtools multi-way pileup samtools mpileup -uf chr22.fasta Example.sorted.bam | bcftools view -bvcg - > Example.bcf ; bcftools view Example.bcf | vcfutils.pl varFilter -D100 > Example.vcf
42
Exome capture alignment & variant calling exercise
4. Indexing VCF file for IGV display bgzip -c Example.vcf > Example.vcf.gz ; tabix -p vcf Example.vcf.gz 5. IGV display igv.sh & File Load Genome… Select chr22.genome and click Ok File Load from File… Select Example.sorted.bam and click Ok Select Example.vcf.gz and click Ok Select nimbleGen_SeqCapEZ_exome_chr22.bed and click Ok
43
Exome capture alignment & variant calling exercise
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.