Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment and Assembly Applications: structural changes, GWAS
The chromosome
SNP Variations in DNA sequence. Single Nucleotide Polymorphism (SNP) --- a single letter change in the DNA. Common SNPs occur every few hundred bases. Each form is called an “allele”. Almost all SNPs have only two alleles. Allele frequencies are often different between ethnic groups. pedia/commons/thumb/2/2e/Dn a-SNP.svg/180px-Dna- SNP.svg.png
Correlations between SNPs Why measure the SNP alleles? crossing_over.gif DNA change in two ways during evolution: Point mutation SNPs Recombination This happens in large segments. Alleles of adjacent SNPs are highly dependent. Haplotype: A group of alleles linked closely enough to be inherited mostly as a unit.
Why SNP? html.en Figure 1: This diagram shows two ancestral chromosomes being scrambled through recombination over many generations to yield different descendant chromosomes. If a genetic variant marked by the A on the ancestral chromosome increases the risk of a particular disease, the two individuals in the current generation who inherit that part of the ancestral chromosome will be at increased risk. Adjacent to the variant marked by the A are many SNPs that can be used to identify the location of the variant.
Why SNP? Nature Genetics 26, (2000) Figure 1. Schematic model of trait aetiology. The phenotype under study, Ph, is influenced by diverse genetic, environmental and cultural factors (with interactions indicated in simplified form). Genetic factors may include many loci of small or large effect, G Pi, and polygenic background. Marker genotypes, Gx, are near to (and hopefully correlated with) genetic factor, G p, that affects the phenotype. Genetic epidemiology tries to correlate G x with Ph to localize G p. Above the diagram, the horizontal lines represent different copies of a chromosome; vertical hash marks show marker loci in and around the gene, G p, affecting the trait. The red P i are the chromosomal locations of aetiologically relevant variants, relative to Ph. SNPs The gene deciding pheonotype
SNP array The SNP array Affymetrix.com
SNP array The SNP array Affymetrix.com 40 probes per SNP (20 for forward strand and 20 for reverse strand.) PM/MM strategy. Data summary (generating AA/AB/BB calls) omitted here.
SNP array Genotype calls Association analysis Linkage analysis Loss of Heterozygosity Signal strength Copy number abberation
CNA --- Background Copy Number Aberration (CNA): A form of chromosomal aberration Deviation from the regular 2 copies for some segments of the chromosomes One of the key characteristics of cancer CNA in cancer: Reduce the copy number of tumor-suppressor genes Increase the copy number of oncogenes Possibly related to metastasis
CNA --- the statistician’s task High density arrays allow us to identify “focused CNA”: copy number change in small DNA segments. With the high per-probeset noise, how to achieve high sensitivity AND specificity?
CNA – maximizing sensitivity/specificity Two approaches that complement each other: Reducing noise at the single probeset level: Based on dose-response (Huang et al., 2006) Based on sequence properties (Nannya et al., 2005) Segmentation methods. Smoothing; Hidden Markov Model-based methods; Circular Binary Segmentation … …
HMM data segmentation Fridlyand et al. Journal of Multivariate Analysis, June 2004, V. 90, pp Amplified Normal Deleted
Forward-backword fragment assembling
Some example: Top: model cell line, 3 copy segment in chromosome 9 Bottom: Cancer sample
Keith W. Brown and Karim T.A. Malik, 2001, Expert Reviews in Molecular Medicine LOH Loss of Heterozygosity (LOH) Happens in segments of DNA.
Discov Med Jul;12(62): LOH On SNP array, LOH will yield identical calls (AA or BB, rather than AB) for a number of consecutive SNPs.
GWAS © Pasieka, Science Photo Libraryhttp://
GWAS
Nature Genetics 41, (2009) GWAS Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer
DNA sequencing
Background
When a reference genome is available --- Alignment Can rely on existing reference genome as a blue print. Align the short reads onto the reference genome. Need a few fold coverage to cover most regions. Sequence a whole new genome? --- Assembly Overlaps are required to construct the genome. The reads are short need ~30 fold coverage. If 3G data per run, need 30 runs for a new genome similar to human size. Alignment and Assembly
Hash table-based alignment. Similar to BLAST in principle. (1) Find potential locations: (2) Local alignment.
Alignment and Assembly From read to graph:
Alignment and Assembly
de Bruijn graph assembly Red: read error.
Alignment and Assembly de Bruijn graph assembly
Alignment and Assembly de Bruijn graph assembly
Whole gnome/exome/transcriptome sequencing
Genomics Whole genome sequencing detects all variants (SNP alleles, rare variants, mutations) Could be associated with disease: Rare variants (burden testing by collapsing by gene) De novo mutations (need family tree) Rare Mendelian disorders Structural variants in cancer
Identification of translocations from discordant paired-end reads. Cancer Genetics 206 (2014) 432e440 Structural changes
CNV by depth of coverage Cancer Genetics 206 (2014) 432e440 Structural changes
Cancer Genetics 206 (2014) 432e440 Structural changes
Genotype calling
Medical Genomics Nature Reviews Genetics 11, 415 Example: Extreme-case sequencing to find rare variants associated with a disease.
GWAS