Detection of Rare-Alleles and Their Carriers Using Compressed Se(que)nsing Or Zuk Broad Institute of MIT and Harvard In collaboration.

Slides:



Advertisements
Similar presentations
Lecture 2 Strachan and Read Chapter 13
Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Which Phenotypes Can be Predicted from a Genome Wide Scan of Single Nucleotide Polymorphisms (SNPs): Ethnicity vs. Breast Cancer Mohsen Hajiloo, Russell.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Sequencing shRNA libraries with DNA Sudoku Yaniv Erlich Hannon Lab Yaniv Erlich Hannon Lab Compressed Genotyping Cold Spring Harbor.
Basics of Linkage Analysis
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
The role of variation in finding functional genetic elements Andy Clark – Cornell Dave Begun – UC Davis.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Human non-synonymous SNP: molecular function, evolution and disease Shamil Sunyaev Genetics Division, Brigham & Women’s Hospital Harvard Medical School.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Population Genetics “The study of genetic variation and its causes in population”
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Constant Allele Frequencies Hardy-Weinberg Equilibrium.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
KinSNP Software for homozygosity mapping of disease genes using SNP microarrays El-Ad David Amir 1, Ofer Bartal 1, Yoni Sheinin 2, Ruti Parvari 2 and Vered.
Population Genetics. Macrophage CCR5 CCR5-  32.
Population Genetics Unit 4 AP Biology.
High Throughput Sequencing
Title: Population Genetics 12th February 2014
Identifying deleterious Single Nucleotide Polymorphisms using multiple sequence alignments CMSC858P Project by Maya Zuhl.
Habil Zare Department of Genome Sciences University of Washington
Ensuring the Quality of Genetic Testing ICORD Meeting September 14, 2007 Lisa Kalman, PhD Coordinator, GeT-RM CDC
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Analyzing DNA Differences PHAR 308 March 2009 Dr. Tim Bloom.
Single Nucleotide Polymorphism
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Chapter 7 Population Genetics. Introduction Genes act on individuals and flow through families. The forces that determine gene frequencies act at the.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
CS177 Lecture 10 SNPs and Human Genetic Variation
Development and Application of SNP markers in Genome of shrimp (Fenneropenaeus chinensis) Jianyong Zhang Marine Biology.
CATALYST Recall and Review: – What are chromosomes? – What are genes? – What are alleles? How do these terms relate to DNA? How do these terms relate to.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Watson School of Biological Sciences Cold Spring Harbor Laboratory Watson School of Biological Sciences Cold Spring Harbor Laboratory.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
1 DNA Polymorphisms: DNA markers a useful tool in biotechnology Any section of DNA that varies among individuals in a population, “many forms”. Examples.
Sampling Design in Regional Fine Mapping of a Quantitative Trait Shelley B. Bull, Lunenfeld-Tanenbaum Research Institute, & Dalla Lana School of Public.
Allele Frequencies: Staying Constant Chapter 14. What is Allele Frequency? How frequent any allele is in a given population: –Within one race –Within.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Genomic Data Privacy Protection Using Compressive Sensing 1 »University of Oklahoma -Tulsa Aminmohammad Roozgard, Nafise Barzigar, Dr. Pramode Verma, Dr.
What happens to genes and alleles of genes in populations? If a new allele appears because of a mutation, does it… …immediately disappear? …become a permanent.
C2BAT: Using the same data set for screening and testing. A testing strategy for genome-wide association studies in case/control design Matt McQueen, Jessica.
Large-Scale Matrix Factorization with Missing Data under Additional Constraints Kaushik Mitra University of Maryland, College Park, MD Sameer Sheoreyy.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
The Haplotype Blocks Problems Wu Ling-Yun
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Evolution of Populations Population- group of individuals of the same species that live in the same area and interbreed. Gene Pool- populations genetic.
Interpreting exomes and genomes: a beginner’s guide
Nucleotide variation in the human genome
Disease risk prediction
Genetics Primer to Evolution
Discovery tools for human genetic variations
Beyond GWAS Erik Fransen.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Genetics of Human Cardiovascular Disease
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Detection of Rare-Alleles and Their Carriers Using Compressed Se(que)nsing Or Zuk Broad Institute of MIT and Harvard In collaboration with: Amnon Amir Dept. of Physics of Complex Systems, Weizmann Inst. of Science Noam Shental Dept. of Computer Science, The Open University of Israel

The Problem Identify genotypes (disease) in a large population AB AA genotypes Specifics: Large populations (hundreds to tens of thousands) Rare alleles Pre-defined genomic regions

Naïve Approach – Targeted selection + Next Gen Seq.: One Test per Individual collect DNA samples Apply 9 independent tests AB AA fraction of B’s out of tested alleles 0 1/ Problem: Rare alleles require profiling a high number of individuals. Still very costly. Multiplexing/barcoding provides partial solution (laborious, expensive, often not enough different barcodes) Targeted selection

Our approach - Targeted Selection + Smart pooling + Next Gen seq. collect DNA samples. Prepare Pools Advantages: Fewer pools Reduced sample preparation and sequencing costs Can still achieve accurate genotypes Apply 3 pooled tests AB AA fraction of B’s out of tested alleles 0 1/ Targeted selection Reconstruct genotypes

Application 1: Rare recessive genetic diseases CarrierHealthy! NormalHealthy GenotypePhenotype AffectedSick Identify carriers of known deleterious mutations

Nationwide carrier screen

Genetic DisorderCarrier rate Tay-Sachs1:25 Cystic Fibrosis1:30 Familial Dysautonomia1:30 Usher Syndrome1:40 Canavan1:40 Glycogen Storage1:71 Fanconi Anemia C1:80 Niemann-Pick1:80 Mucolipidosis type 41:100 Bloom1:102 Nemaline Myopathay1:108 Large scale carrier screen (rates vary across ethnic groups)

Specific mutations - notation “A” “B” Reference genome …AGCGTTCT… …AGTGTTCT… Single-nucleotide polymorphism (SNPs) …AGGTTCT Insertions/Deletions (InDels) Carrier test screen: Amplify a sample of DNA and then test “AA” “AB” fraction of B’s out of tested alleles 1/2 0

Application 2: Genome Wide Association Studies collect DNA samples AB BBABBBAA AB CasesControls AAAB AA ABAA Count: CasesControls AAX AA Y AA ABX AB Y AB BBX BB Y BB Try ~10 5 – 10 6 different SNPs. Significant ones called ‘discoveries’/’associations’ Statistical test, p-value

What Associations are Detected? [T.A. Manolio et al. Nature 2009] Goal: push further Find Novel mutations associated with common disease and their carriers

What Associations are Detected? Find Novel mutations associated with common disease and their carriers Proposed approaches: Profile larger populations. Look at SNPs with lower Minor Allele Frequency Re-sequencing in regions with common SNPs found, and other regions of interest

infer/reconstruct Compressed Sensing Based Group Testing Next Generation Sequencing Technology compressed sensing (CS) a few tests instead of 9 fraction of B’s

Rare Allele Identification in a CS Framework individuals in the pool # rare alleles

The standard CS problem: n variables k << n equations But: x is sparse: Matrix should obey certain properties (Robust Isometry Property) Example: random Gaussian or Bernoulli matrix Then: Can reconstruct x uniquely with k = O(s log(n/s)) equations (a.k.a. ‘measurements’) Can do so efficiently, even for large matrices (L 1 minimization) Compressed Sensing (CS)

NextGenSeq Output output: “reads” Example: Illumina, A few millions reads per lane Read length – a few dozens to a few hundreds line = “read”

NextGenSeq – Targeted Sequencing Measure the number of reads containing B out of total number of reads. Here: 1/16

Parts of this modeling appeared in [P. Prabhu & I. Pe’er, Genome Research July 09] Ideal measurement - the fraction of “B” reads: Model Formulation r is itself a random variable 1.sampling noise: finite number of reads from each site - r NGST measurement: 2. Technical errors: read errors: 0.5-1% DNA preparation errors, Estimated frequency: sparsity-promoting term error term

Results (simulations) arxiv v1 [f = freq. of rare allele] Can reconstruct over 10,000 people with no errors, using only 200 lanes Software Package: Comseq [unique solver for this application noise model, translating to CS, reconstruction..]

Results (real data) 1.Pooled-sequencing experimental data Validate the Pooling part (variation in amount of DNA) genomes data Validate all other technical errors (e.g. read error, sampling error ) in a large-scale experiment

Results (dataset 1) Pooling dataset from: [Out et al., Human Mutation 2009] 88 People in one pool – region length (hyb-selection) sequenced by 5 SNPs identified, of which 9 are ‘rare’ (carrier freq. < 4%): 5 with one carrier, 3 with two carriers, 1 with one carrier. Create ‘in-silico’ pools: Randomize individuals’ identity in each pool Determine number of carriers Sample frequencies based on observed frequencies in the single pool for the same number of carriers

Results (dataset 1) Pooling dataset from: [Out et al., Human Mutation 2009] Cartoon:

Results (dataset 1) One and two carriers: real pooling results match theoretical model Three carriers: real pooling are worse due to one problematic SNP When constructing pools of at most 2 people, results match theoretical model # tests % with perfect reconstruction

Results (dataset 2) 1000 Genomes Data: Pilot 3 data: Exome Sequencing, ~1000 genes, ~700 people Filtered: 633 rare SNP (MAF < 2%), of which 20 contained rar heterozygous 364 individuals sequenced by Illumina Create ‘in-silico’ pools: Randomize individuals’ identity in each pool Determine number of carriers Sample and individual from the pool at random. Then sample a read from the set of reads for this individual.

Results (dataset 2) Results from derived from actual 1000 genomes read match Simulations from our statistical model

Generic approach: puts together sequencing and CS to identify rare allele carriers. Naturally deals with all possible scenarios of multiple carriers and heterozygous or homozygous rare alleles. Much higher efficiency over the naive approach. Can be combined with barcoding Manuscript available on arxiv: arxiv v1 [N. Shental, A. Amir and O. Zuk, in revision] Comseq Package: Code Available at: [simulating, designing experiments, reconstructing genotypes..] Conclusions

Thank You Noam Shental Amnon Amir