BF528 - Whole Genome Sequencing and Genomic Variation

Slides:



Advertisements
Similar presentations
Considerations for Analyzing Targeted NGS Data HLA
Advertisements

Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
- any detectable change in DNA sequence eg. errors in DNA replication/repair - inherited ones of interest in evolutionary studies Deleterious - will be.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
MUTATIONS SC STANDARD B-4.9: The student will exemplify ways in which new characteristics are introduced into an organism or a population.
The Biology and Genetic Base of Cancer. 2 (Mutation)
CS177 Lecture 10 SNPs and Human Genetic Variation
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Identification of Copy Number Variants using Genome Graphs
Chapter 12 Section 4 Mutations. Mutations DNA contains the code of instructions for cells. Sometimes, an error occurs when the code is copied. - Such.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Mutations Section Objectives for this section  Contrast gene mutations and chromosomal mutations.
Mutations.
Types of mutations Mutations are changes in the genetic material
Genomics Chapter 18.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Chapter 2 Genetic Variations. Introduction The human genome contains variations in base sequence from one individual to another. Some sequence variants.
Chapter 21 Genetic Variation and Evolution. What is the goal of the Fast Plant Experiment? What are you measuring? What are you comparing?
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
8 and 11 April, 2005 Chapter 17 Population Genetics Genes in natural populations.
The Haplotype Blocks Problems Wu Ling-Yun
Genome-Wides Association Studies (GWAS) Veryan Codd.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
From Reads to Results Exome-seq analysis at CCBR
Interpreting exomes and genomes: a beginner’s guide
Next Generation Sequencing
Genomic Analysis: GWAS
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Week-6: Genomics Browsers
Molecular mechanism of mutation
Gene sequencing Analysis
Of Sea Urchins, Birds and Men
Section 1: Mutation and Genetic Change
Mutations 6/26/2018 SB2d.
Chapter 14 GENETIC VARIATION.
DNA Marker Lecture 10 BY Ms. Shumaila Azam
12- 4 Mutations.
Introduction to bioinformatics lecture 11 SNP by Ms.Shumaila Azam
Gene Hunting: Design and statistics
What makes a mutant?.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Power to detect QTL Association
Linking Genetic Variation to Important Phenotypes
Mutations changes in the DNA sequence that can be inherited
Mutations & Genetic Engineering
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
12.4 Mutations Kinds of Mutations Significance of Mutations.
Copyright Pearson Prentice Hall
Higher Biology Unit 1: 1.6 Mutations.
Medical genomics BI420 Department of Biology, Boston College
Copyright Pearson Prentice Hall
BF528 - Genomic Variation and SNP Analysis
Medical genomics BI420 Department of Biology, Boston College
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
Canadian Bioinformatics Workshops
Copyright Pearson Prentice Hall
Copyright Pearson Prentice Hall
Copyright Pearson Prentice Hall
Reminder The AP Exam registration is open in Naviance. The Exam is on Monday, May 13. I’ll let you know when the next test/homework will be.
SNPs and CNPs By: David Wendel.
Copyright Pearson Prentice Hall
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
12–4 Mutations 12-4 Mutations Copyright Pearson Prentice Hall.
Copyright Pearson Prentice Hall
Copyright Pearson Prentice Hall
Presentation transcript:

BF528 - Whole Genome Sequencing and Genomic Variation

Whole Genome/Exome Sequencing

Whole Genome Sequencing Whole Genome Sequencing (WGS) Generate enough reads to attain: >95% coverage of source genome >30x average depth Two strategies: De novo: assemble reads into a new sequence Re-sequence: refine an existing reference sequence

Michael Schatz, Cold Spring Harbor De novo assembly Genome Assembly - Create new reference ‘from scratch’ Examine reads for overlapping sequence Contig - longer assembled sequence from short reads Scaffold - assembled contigs Chromosome - assembled scaffolds Assembly from short reads is hard Michael Schatz, Cold Spring Harbor

De novo assembly: greedy algorithm Kyriakidou, Maria, Helen H. Tai, Noelle L. Anglin, David Ellis, and Martina V. Strömvik. 2018. “Current Strategies of Polyploid Plant Genome Sequence Assembly.” Frontiers in Plant Science 9 (November): 1660.

De novo assembly: graph-based Greedy assembly creates linear sequences Vulnerable to finding local optima Graph representation considers all sequence content simultaneously Graph data structure can encode variability (e.g. insertions, SNPs) Computationally much more expensive

Reference guided genome assembly Refine existing reference with new sequence Can discover: New structural variants Novel insertions/alternate haplotypes or scaffolds Polymorphisms Faster, easier than de novo assembly More sensitive to existing biases in reference

Reference guided genome assembly Kyriakidou, Maria, Helen H. Tai, Noelle L. Anglin, David Ellis, and Martina V. Strömvik. 2018. “Current Strategies of Polyploid Plant Genome Sequence Assembly.” Frontiers in Plant Science 9 (November): 1660.

Human genome reference Genome Reference Consortium human build 38 patch 12 Assembly GRCh37 patch 13 (hg19) Assembly GRCh38 patch 12 (hg38, current draft)

Assembly GRCh38.p12 from NCBI Human genome - GRCh37 Genome Reference Consortium human build 37 patch 13 Assembly GRCh38.p12 from NCBI

Whole Exome Sequencing Whole Exome Sequencing (WES) Exons 1%-2% of human genome sequence Pre-select reads that map to exons Sequence to much greater depth than WGS Identify coding variants

Exome Sequence Selection Array-based capture In-solution

Genomic Variation

Genomic Variants Individual genomes from same species vary WGS/WES compared with reference can identify differences Variant: sequence that varies within species Two general types: Small: <50 bp, single nucleotide polymorphisms (SNPs), indels Large: >50 bp, copy number variations, duplications, deletions, translocations, inversions

Genomic variants

Point mutations reference: AA-TACGGACGGACTTTA read1: AACTACGG- CGGACTTTA read2: AACTACGG-CGGACTTTA read3: AACTACGG-CGGCCTTTA read4: AACTACGG- CGGACTTGA read5: AACTACGG- CGGACTTGA INsertion DELetion SNP

Single Nucleotide Polymorphisms (SNPs) Most commonly studied type of variant Types of single nucleotide alterations: Variant (SNV) - any single mutation Polymorphism (SNP) - SNV of appreciable frequency in population (e.g. >1%) Typically base changes, e.g. A → C SNPs (usually) indicate shared ancestry May suggest disease mechanism

Structural variation Deletions - sequence missing Insertions: Novel - new sequence added Mobile-element - copied/moved from elsewhere in genome Duplications: Tandem - consecutive Interspersed - non-consecutive Inversion - segment is reversed Translocation - segment moves

Genotyping Genotype: an individual’s variant(s) Phenotype: an individual’s physical form De novo variant calling/detection: given genomic reads and a reference, find all the variants Genotyping: examine individual for a priori variants at known locations Can use arrays (i.e. SNP Chip) or WGS/WES

Human Genomic Variants Humans have diploid genome i.e. 2 copies of each gene 660 million annotated human SNPs 113 million of those have been validated Every human has on average: A variant every 1kb 2-3 million SNPs dbSNP - NCBI repository for SNP info

Human Genomic Terminology Allele: sequence containing a variant Homozygous variant: both same allele i.e. either same variant or same as reference Heterozygous variant: different alleles i.e. one is variant, one is reference/different variant

Human Genomic Terminology For coding (exonic) variants: Synonymous or sense: variant does not change amino acid sequence Non-synonymous or mis-sense: variant causes amino acid change Non-sense: causes early termination of protein by introducing stop codon Frameshift: insertion or deletion causes complete recoding of downstream proteins

Genomic Variant Terminology Germline: Inherited from parents e.g. blue eyes, familial disease risk Somatic: Acquired during life e.g. tumor vs normal tissue Allele frequency: how common is a given variant in some population, e.g.: 1% of human population 30% within people with some disease

Genomic Variant Encoding Reference allele - reference sequence Alternate allele - variant sequence We show alleles as: 0/0 both reference allele 0/1 one reference allele, one alternate 1/1 both non-reference allele, homozygous 1/2 both non-reference, heterozygous

dbSNP - NCBI SNP Database

Variants Smaller Than A Read Finding SNPs, indels almost a solved problem SNPs called are 95% accurate i.e. with sufficient coverage Structural variants cause false positives Duplications, somatic mutations may cause 3 or more alleles to be observed

SNP and indel density is non-random

Variants Larger Than A Read Structural Variation (SV) Two types: Balanced - Do not change amount of DNA Copy Number Variants (CNV) - Change amount of DNA Scales: Mini (hundreds of basepairs) Macro (visible by a microscope) variants Much harder to find (especially balanced) Non-random: SV ‘hotspots’

Haplotyping DNA recombines in large blocks SNPs in a block move around together Looking at the common SNPs in a block, reveals the ancestry information Linkage Disequilibrium (LD): adjacent SNPs co-occur more often than expected i.e. 2 SNPs are in LD with each other

Haplotyping

Genome Wide Association Studies (GWAS) Which SNPs are associated with a variable of interest? e.g. disease, height Does the frequency of any SNP differ between groups? Associated SNPs have: Effect size - e.g. amount of increased risk p-value - precision of effect Risk allele: allele associated with increased or decreased probability of having a disease

Calling SNPs - samtools samtools mpileup -u -v -r chr22:29268316-29300343 -d 150 -f ../06/ref/chr22.fa NA12878_phased_chr22.bam > NA12878_chr22_samtools_EWSR1.vcf

Calling small variants - GATK GATK - Genome Analysis Toolkit gatk HaplotypeCaller \ -L chr22:29268316-29300343 \ -R ../06/ref/chr22.fa \ -I NA12878_phased_chr22.bam \ -O NA12878_chr22_gatk_EWSR1.vcf.gz \ -ERC GVCF # BP_RESOLUTION