BF528 - Genomic Variation and SNP Analysis 02/09/2018
After the data curator aligns the NGS datasets, checks the quality and statistics of the alignment and reads we can run some analysis. Here we will talk about variation in an individual in comparison to the reference genome.
Genomic variants Variants can be small or large. < 50 bp: SNP, indels, microsatelites… fits into a read >1 Kbp : structural variation (CNV: deletion, insertion, duplication or balanced: inversion, translocation), hard to find, use paired-end reads
Small variants reference: AA-TACGGACGGACTTTA read1: AACTACGG-CGGACTTTA read3: AACTACGG-CGGCCTTTA read4: AACTACGG-CGGACTTGA read5: AACTACGG-CGGACTTGA INsertion DELetion SNP
Structural variation
Genomic variants
Genomic variants Homozygous variation: both chromosomes have the variant in comparison to the reference. Heterozygous variation: only one chromosome has the variant. Need more sampling coverage to find heterozygous events 15X coverage required to have enough power for homozygous events. 30X for heterozygous.
Genomic variants We show alleles as: 0/0 both reference allele 0/1 one reference allele and one different 1/1 both non-reference allele 1/2 both non-reference allele and heterozygous
Genomic variants Germline: Comparing one individual to the reference Somatic: comparing two non-germline cells in an individual. First compare both to the reference. Get the differences. Example: cancer vs. normal tissue. More complicated due to unknown number of copies of a chromosome Needs higher coverage (~100X)
Genomic variants De novo variant calling/detection: given a bam file, find all the variants. Genotyping: given a region of interest, test whether the variant exists there or not. De novo is harder, genotyping is used when we have hotspots.
Variants smaller than a read Such as : SNP, InDels Almost a solved problem SNPs called are 95% accurate, but presence of SV cause false positives. Example: HLA genes Small variants are RANDOM events. 0.1% prevalence
SNP/InDel Analysis One SNP per every ~1Kbp ~15M common (>1%) SNPs and indels To study common SNPs we can use SNP arrays. Haplotyping (ancestry ) GWAS To study rare SNPs we use NGS. Rare disease Fingerprinting
SNP and indel density
Haplotyping Recombinations through populations make conserved blocks. SNPs in a block move around together. Looking at the common SNPs in a block, reveals the ancestry information.
Haplotypes
Haplotyping
GWAS Genome Wide Association Studies Given a large group of patients (case) vs normal population (control) we look for common SNPs associated with the disease/phenotype. Association does not mean causation.
GWAS
GWAS Two important statistics: p-value → the difference is significant odd-ratio → the effect size is significant
Rare SNPs Use tools to call SNPs. Each individual will have thousands of unique SNPs.
Calling SNPs - samtools samtools mpileup -u -v -r chr22:29268316-29300343 -d 150 -f ../06/ref/chr22.fa NA12878_phased_chr22.bam > NA12878_chr22_samtools_EWSR1.vcf
VCF file format Variants are kept in VCF format
VCF file format # header line
Calling small variants - GATK gatk HaplotypeCaller \ -L chr22:29268316-29300343 \ -R ../06/ref/chr22.fa \ -I NA12878_phased_chr22.bam \ -O NA12878_chr22_gatk_EWSR1.vcf.gz \ -ERC GVCF # BP_RESOLUTION
Calling small variants - GATK gatk HaplotypeCaller \ -L chr22:29268316-29300343 \ -R ../06/ref/chr22.fa \ -I NA12878_phased_chr22.bam \ -O NA12878_chr22_gatk_EWSR1.vcf.gz \ -ERC GVCF # BP_RESOLUTION
Large variants Structural Variation (SV) Balanced Inversion, translocation Do not change amount of DNA Very difficult to find Copy Number Variants (CNV) Duplication, insertion, deletion Changes the amount of DNA, easier to find
Large variants Mini (hundreds of basepairs) and macro (visible by a microscope) variants Poorly studied Guesses are 15% between two individual Human and primate problem Not random, occur on hotspots NAHR and NEJH (driven by repeats) Inversions result in deletion, translocation to duplication
SV calling strategies Read signatures: read pair, depth, split read, assembly Insertions can only be found by assembly Balanced SV are very difficult to find (no reliable computational method) CNV are almost solved One type of SV causes another, complex, nested… Causes: NAHR, NEHJ ...
Read signatures
Read-pair signatures for inversions Reference Inverted
Read-pair signature
SV discovery tools Best ones: Delly2 Lumpy GATK (smaller) All suffer from high false positive rates (especially for balanced SV) Every tool has it own size detection range.
SV validation SV need to be validated in the lab due to high false positive rates. Using long reads In the lab with FISH experiments
SV validation Fluorescence In Situ Hybridization (FISH)
OMIC Tools OMICtools: The community platform for bioinformatics This portal has a collection of all tools in bioinformatics from the literature with ratings.