Dowell Short Read Class Phillip Richmond ReSequencing Dowell Short Read Class Phillip Richmond
Outline The Plan Organize and copy data to your own working directory Map reads back to a reference genome Convert sam to bam Remove duplicates Run a variant caller Visualize variants
Plan The first round of variant calling we’re going to do will involve cutting the yeast genome Sigma1278b into reads, mapping them back to the S288c reference genome, and then finding all SNP differences between the two genomes This data will be synthetic The reads will already be produced for you in fastq format, 1x50 bp reads
Getting started Organization is KEY!! For the resequencing tutorial this is the organization that will be necessary: Make a new directory in your home directory called: ReSequencing Inside of ReSequencing make subdirectories: GENOME FASTQ SAM VCF PBS
Copying the data Now we want to copy the data from: /projects/sreadgrp/homeworkfiles/ReSequencing/ Copy the Fastq file from the FASTQ directory (Sigmav7_50mers.fastq) to your own FASTQ directory Copy SGDv4.fasta from GENOME/ to your own directory GENOME/ Copy the PBS files to your own PBS directory: IndexGenome.pbs MapReads.pbs Sam2Bam.pbs IndelRealign.pbs CallSNPs.pbs
Index the genome (IndexGenome.pbs) Command: /opt/bowtie/bowtie2-2.0.2/bowtie2-build <in.fasta> <out_index> My Command: /opt/bowtie/bowtie2-2.0.2/bowtie2-build /Users/richmonp/ReSequencing/GENOME/SGDv4.fasta /Users/richmonp/ReSequencing/GENOME/SGDv4_bowtie2_Index
Map the reads back to the genome (MapReads.pbs) These reads need to have “readgroups” in order to work. It’s best to add these when we map using the bowtie2 options --rg and --rg-id: Example: --rg-id Sigmav7vsS288c_bowtie2 –rg SM:Sigmav7vsS288c_bowtie2 Full Command: /opt/bowtie/bowtie2-2.0.2/bowtie2 --rg-id Sigmav7vsS288c_bowtie2 --rg SM:Sigmav7vsS288c_bowtie2 /Users/richmonp/ReSequencing/GENOME/SGDv4_bowtie2_Index /Users/richmonp/ReSequencing/FASTQ/Sigmav7_50mers.fastq –S /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.sam 2> /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.stderr
Convert your file format using Samtools (Sam2Bam.pbs) samtools view –bS <in.sam> -o <out.bam> samtools sort <in.bam> <out.sorted> samtools index <in.sorted.bam> /opt/samtools/0.1.18/samtools view –bS /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.sam –o /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.bam /opt/samtools/0.1.18/samtools sort /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.bam /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.sorted /opt/samtools/0.1.18/samtools index /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.sorted.bam
Samtools remove duplicates (Sam2Bam.pbs) Removes duplicate reads from PCR errors in reads. samtools rmdup <in.sorted.bam> <out.rmdup.sorted.bam> /opt/samtools/0.1.18/samtools rmdup /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.sorted.bam /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup.sorted.bam
Realign around indels (IndelRealign.pbs) GATK has a two-step process for realigning reads around indels Step 1: Find candidate locations that may be best represented by an insertion or deletion GATK’s RealignerTargetCreator Step 2: Apply local realignment around the candidate locations to produce a new bam file GATK’s IndelRealigner
Realign around Indels: RealignerTargetCreator java –jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar –R <reference genome> -T RealignerTargetCreator (options) –I <in.sorted.rmdup.bam> -o <out.intervals> java -jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar -R /Users/richmonp/ReSequencing/GENOME/SGDv4.fasta \ -T RealignerTargetCreator -minReads 5 \ -I /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup.sorted.bam -o /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup.intervals
Realign around indels: IndelRealigner java –jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar –T IndelRealigner –model USE_READS –targetIntervals <in.intervals> -R <reference.fasta> -I <in.rmdup.sorted.bam> -o <out.rmdup.realigned.sorted.bam> java -jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar -T IndelRealigner -model USE_READS \ -targetIntervals /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup.intervals \ -R /Users/richmonp/ReSequencing/GENOME/SGDv4.fasta \ -I /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup.sorted.bam -o /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup_realigned.sorted.bam
Call variants using GATK UnifiedGenotyper (CallSNPs.pbs) The GATK package is a java executable, or a .jar file. To run the package you type: java –jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar Then you must select a –T, or a program within the package to run, which in our case is UnifiedGenotyper java –jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar –T UnifiedGenotyper
Call variants using GATK UnifiedGenotyper java –jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar –T UnifiedGenotyper -glm BOTH -I <in.sorted.bam> -R <in.fasta> -o <out.vcf> java -jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar -T UnifiedGenotyper -glm BOTH -R /Users/richmonp/ReSequencing/GENOME/SGDv4.fasta -I /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup_realigned.sorted.bam -o /Users/richmonp/ReSequencing/VCF/Sigmav7_vs_S288c_bowtie2_gatk.vcf
View your VCF in IGV GATK automatically indexes your VCF files, so now we can visualize both the reads and SNPs in IGV Transfer both the final bam file (/Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup_realigned.sorted.bam) and the vcf file (/Users/richmonp/ReSequencing/VCF/Sigmav7_vs_S288c_bowtie2_gatk.vcf) to your student directory on /projects/sreadgrp/student/<username>/ Open up the visualization VNC window Open IGV Load the files
Organize into groups of 5 Coffee Break Then… Organize into groups of 5
Paired-end data The main difference between paired-end and single-end data will occur when you are mapping Each read in the pair is denoted by either “R1” or “R2” 1028_S1_L001_R1_001.fastq 1028_S1_L001_R2_001.fastq
How it changes your bowtie2 command: Open up MapPairedReads.pbs in an editor Notice: -1 /data/Avery/FASTQ/1056_S1_L001_R1_001.fastq \ -2 /data/Avery/FASTQ/1056_S1_L001_R2_001.fastq \ The -1 is for read 1, and the -2 is for read 2
Now… Copy the MapPairedReads.pbs to your own PBS directory (from /projects/sreadgrp/homeworkfiles/ReSequencing/PBS/) Copy a pair of fastq files to your FASTQ directory (only copy the ones based on your group problem sheet)
First group to map, call variants, and visualize variants, wins First group to map, call variants, and visualize variants, wins! (prizes are not amazing)