BF528 - Genomic Variation and SNP Analysis

Slides:



Advertisements
Similar presentations
What is an association study? Define linkage disequilibrium
Advertisements

Discovery of Structural Variation with Next-Generation Sequencing Alexandre Gillet-Markowska Gilles Fischer Team – Biology.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
DNAseq analysis Bioinformatics Analysis Team
High Throughput Sequencing
Using the whole read: Structural Variation detection with RPSR
Genomics, Cancers & Infectious Diseases Qunyuan Zhang Division of Statistical Genomics Washington University School of Medicine.
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
BNFO 602 Lecture 1 Usman Roshan.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.
Towards Personal Genomics Tools for Navigating the Genome of an Individual Saul A. Kravitz J. Craig Venter Institute Rockville, MD Bio-IT World 2008.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Large-Scale Copy Number Polymorphism in the Human Genome J. Sebat et al. Science, 305:525 Luana Ávila MedG 505 Feb. 24 th /24.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
GenomeVIP: A Genomics Analysis Pipeline for Cloud Computing with Germline and Somatic Calling on Amazon’s Cloud R. Jay Mashl October 20, 2014.
A Genome-wide association study of Copy number variation in schizophrenia Andrés Ingason CNS Division, deCODE Genetics. Research Institute of Biological.
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
Identification of Copy Number Variants using Genome Graphs
Cancer genomics Yao Fu March 4, Cancer is a genetic disease In the early 1970’s, Janet Rowley’s microscopy studies of leukemia cell chromosomes.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
The International Consortium. The International HapMap Project.
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Recent Advances in Genomic Science Julian Sampson Institute of Medical Genetics, Cardiff.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Data and Hartwig Medical Foundation
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Gil McVean Department of Statistics
Disease risk prediction
Gene sequencing Analysis
Global Variation in Copy Number in the Human Genome
Statistical Applications in Biology and Genetics
Genome Wide Association Studies using SNP
SVs and CNVs They are often confused…
 The human genome contains approximately genes.  At any given moment, each of our cells has some combination of these genes turned on & others.
Introduction to bioinformatics lecture 11 SNP by Ms.Shumaila Azam
Gene Hunting: Design and statistics
High level GWAS analysis
2nd (Next) Generation Sequencing
Linking Genetic Variation to Important Phenotypes
Jianbin Wang, H. Christina Fan, Barry Behr, Stephen R. Quake  Cell 
Applications of DNA Analysis
Eric Samorodnitsky, Jharna Datta, Benjamin M
SNP Arrays in Heterogeneous Tissue: Highly Accurate Collection of Both Germline and Somatic Genetic Information from Unpaired Single Tumor Samples  Guillaume.
BF528 - Whole Genome Sequencing and Genomic Variation
Chromosomal Mutations
Canadian Bioinformatics Workshops
Alignment and CNV analysis in cattle
Presentation transcript:

BF528 - Genomic Variation and SNP Analysis 02/09/2018

After the data curator aligns the NGS datasets, checks the quality and statistics of the alignment and reads we can run some analysis. Here we will talk about variation in an individual in comparison to the reference genome.

Genomic variants Variants can be small or large. < 50 bp: SNP, indels, microsatelites… fits into a read >1 Kbp : structural variation (CNV: deletion, insertion, duplication or balanced: inversion, translocation), hard to find, use paired-end reads

Small variants reference: AA-TACGGACGGACTTTA read1: AACTACGG-CGGACTTTA read3: AACTACGG-CGGCCTTTA read4: AACTACGG-CGGACTTGA read5: AACTACGG-CGGACTTGA INsertion DELetion SNP

Structural variation

Genomic variants

Genomic variants Homozygous variation: both chromosomes have the variant in comparison to the reference. Heterozygous variation: only one chromosome has the variant. Need more sampling coverage to find heterozygous events 15X coverage required to have enough power for homozygous events. 30X for heterozygous.

Genomic variants We show alleles as: 0/0 both reference allele 0/1 one reference allele and one different 1/1 both non-reference allele 1/2 both non-reference allele and heterozygous

Genomic variants Germline: Comparing one individual to the reference Somatic: comparing two non-germline cells in an individual. First compare both to the reference. Get the differences. Example: cancer vs. normal tissue. More complicated due to unknown number of copies of a chromosome Needs higher coverage (~100X)

Genomic variants De novo variant calling/detection: given a bam file, find all the variants. Genotyping: given a region of interest, test whether the variant exists there or not. De novo is harder, genotyping is used when we have hotspots.

Variants smaller than a read Such as : SNP, InDels Almost a solved problem SNPs called are 95% accurate, but presence of SV cause false positives. Example: HLA genes Small variants are RANDOM events. 0.1% prevalence

SNP/InDel Analysis One SNP per every ~1Kbp ~15M common (>1%) SNPs and indels To study common SNPs we can use SNP arrays. Haplotyping (ancestry ) GWAS To study rare SNPs we use NGS. Rare disease Fingerprinting

SNP and indel density

Haplotyping Recombinations through populations make conserved blocks. SNPs in a block move around together. Looking at the common SNPs in a block, reveals the ancestry information.

Haplotypes

Haplotyping

GWAS Genome Wide Association Studies Given a large group of patients (case) vs normal population (control) we look for common SNPs associated with the disease/phenotype. Association does not mean causation.

GWAS

GWAS Two important statistics: p-value → the difference is significant odd-ratio → the effect size is significant

Rare SNPs Use tools to call SNPs. Each individual will have thousands of unique SNPs.

Calling SNPs - samtools samtools mpileup -u -v -r chr22:29268316-29300343 -d 150 -f ../06/ref/chr22.fa NA12878_phased_chr22.bam > NA12878_chr22_samtools_EWSR1.vcf

VCF file format Variants are kept in VCF format

VCF file format # header line

Calling small variants - GATK gatk HaplotypeCaller \ -L chr22:29268316-29300343 \ -R ../06/ref/chr22.fa \ -I NA12878_phased_chr22.bam \ -O NA12878_chr22_gatk_EWSR1.vcf.gz \ -ERC GVCF # BP_RESOLUTION

Calling small variants - GATK gatk HaplotypeCaller \ -L chr22:29268316-29300343 \ -R ../06/ref/chr22.fa \ -I NA12878_phased_chr22.bam \ -O NA12878_chr22_gatk_EWSR1.vcf.gz \ -ERC GVCF # BP_RESOLUTION

Large variants Structural Variation (SV) Balanced Inversion, translocation Do not change amount of DNA Very difficult to find Copy Number Variants (CNV) Duplication, insertion, deletion Changes the amount of DNA, easier to find

Large variants Mini (hundreds of basepairs) and macro (visible by a microscope) variants Poorly studied Guesses are 15% between two individual Human and primate problem Not random, occur on hotspots NAHR and NEJH (driven by repeats) Inversions result in deletion, translocation to duplication

SV calling strategies Read signatures: read pair, depth, split read, assembly Insertions can only be found by assembly Balanced SV are very difficult to find (no reliable computational method) CNV are almost solved One type of SV causes another, complex, nested… Causes: NAHR, NEHJ ...

Read signatures

Read-pair signatures for inversions Reference Inverted

Read-pair signature

SV discovery tools Best ones: Delly2 Lumpy GATK (smaller) All suffer from high false positive rates (especially for balanced SV) Every tool has it own size detection range.

SV validation SV need to be validated in the lab due to high false positive rates. Using long reads In the lab with FISH experiments

SV validation Fluorescence In Situ Hybridization (FISH)

OMIC Tools OMICtools: The community platform for bioinformatics This portal has a collection of all tools in bioinformatics from the literature with ratings.