Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology

Slides:

Advertisements

Similar presentations

Lecture 2 Strachan and Read Chapter 13

Advertisements

Basics of Linkage Analysis

Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.

CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej

Dr. Almut Nebel Dept. of Human Genetics University of the Witwatersrand Johannesburg South Africa Significance of SNPs for human disease.

Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.

Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.

A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College

Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College

The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.

A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College

The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College Cold Spring Harbor Laboratory Advanced Bioinformatics.

Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September.

Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.

Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,

Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:

Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.

Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA

Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.

Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January

Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.

Selecting TagSNPs in Candidate Genes for Genetic Association Studies Shehnaz K. Hussain, PhD, ScM Assistant Professor Department of Epidemiology, UCLA.

Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.

The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College

Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology

Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.

SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,

Conservation of genomic segments (haplotypes): The “HapMap” n In populations, it appears the the linear order of alleles (“haplotype”) is conserved in.

Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.

CS177 Lecture 10 SNPs and Human Genetic Variation

SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.

Gene Hunting: Linkage and Association

Genome-Wide Association Study (GWAS)

A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College

National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.

Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.

Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.

Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.

Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources

MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.

Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College CGDN Bioinformatics Workshop June.

February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.

The International Consortium. The International HapMap Project.

In The Name of GOD Genetic Polymorphism M.Dianatpour MLD,PHD.

Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs

Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.

Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College

Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College

SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,

A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College

Aaron R. Quinlan and Gabor T. Marth Department of Biology, Boston College, Chestnut Hill, MA 02467

Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,

Single Nucleotide Polymorphisms (SNPs

SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.

Introduction to bioinformatics lecture 11 SNP by Ms.Shumaila Azam

Genomes and Their Evolution

Discovery tools for human genetic variations

Genome organization and Bioinformatics

BI820 – Seminar in Quantitative and Computational Problems in Genomics

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Medical genomics BI420 Department of Biology, Boston College

BF528 - Genomic Variation and SNP Analysis

Medical genomics BI420 Department of Biology, Boston College

Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.

Research for medical discovery at the Computational Genomics Laboratory at Boston College Biology Gabor T. Marth Department of Biology, Boston College.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Presentation transcript:

Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology Pfizer visit, March

Our lab focuses on three main projects… 2. software for SNP discovery in clonal and re- sequencing data, 1. software tools for clinical case-control association studies 3. connecting HapMap and pharmaco-genetic data

1. We developing computer software to aid tagSNP selection and association testing gene annotations tags association statistics input data views LD views GUI user control interface reference samples representative computational samples tag evaluation marker selection association testing study specification user input computational sample database (discussed in more detail)

inherited (germ line) polymorphisms are important as they can predispose to disease We build computer tools for SNP discovery we have a 5-year NIH R01 grant to re-develop our computer package, PolyBayes©, our SNP discovery tool originally developed while the PI was at the Washington University Medical School Marth et al. Nature Genetics 1999 looking for SNPs and short INDELs

Apply our tools for genome-scale SNP mining Sachidanandam et al. Nature 2001 ~ 10 million EST WGS BAC genome reference

Extend our methods for SNP detection in medical re- sequencing data from traditional Sanger sequencers… Homozygous T Homozygous C Heterozygous C/T

… and in 454 pyrosequence data 454 sequence from the NCBI Trace Archive accurate base calling for de novo sequencing detection of heterozygotes in medical re-sequencing data Figure from Nordfors, et. al. Human Mutation 19: (2002) (discussed in more detail)

Developing methods to detect somatic mutations (as distinguished from inherited polymorphisms) © Brian Stavely, Memorial University of Newfoundland the detection of somatic mutations, and their distinction from inherited polymorphism, will be important to separate pre-disposing variants from mutations that occur during disease progression e.g. in cancer (discussed in more detail)

Process DNA methylation data obtained with sequencing DNA methylation is important e.g. because hypo- and hypermethylation is consistently present in various cancers Issa. Nature Reviews Cancer, 4, 2004: we are developing methods to interpret DNA methylation data obtained with sequencing, in the presence of methodological artifacts such as incomplete bi-sulfite conversion of un- methylated cytosines Lewin et. al. Bioinformatics, 20: , 2004

… and tools to integrate genetic and epigenetic data from varied sources to find “common themes” during cancer development chromatin structure gene expression profiles copy number changes methylation profiles chromosome rearrangements repeat expansions somatic mutations

3. We are planning a project to connect multi-marker haplotypes to drug metabolic phenotypes predicting metabolic phenotypes (ADR) based on haplotype markers evolutionary origin of drug metabolizing enzyme polymorphisms

Computer software to aid case-control association studies: tagSNP selection and association testing (details) Dr. Eric Tsung

Clinical case-control association studies – concepts association studies are designed to find disease-causing genetic variants searching “significant” marker allele frequency differences between cases and controls AF(cases) AF(controls) clinical cases clinical controls genotyping cases and controls at various polymorphisms

Association study designs region(s) interrogated: single gene, list of candidate genes (“candidate gene study”), or entire genome (“genome scan”) direct or indirect: causative variant marker that is co-inherited with causative variant single-SNP marker or multi- SNP haplotype marker single-stage or multi-stage

Marker (tag) selection for association studies 2. LD-driven – based entirely on the reduction of redundancy presented by the linkage disequilibrium (LD) between SNPs; tags represent other SNPs they are correlated with 1. hypothesis driven (i.e. based on gene function) causative variant for economy, one cannot genotype every SNP in thousands of clinical samples: marker selection is the process where a subset of all available SNPs is chosen

The International HapMap project The international HapMap project was designed to provide a set of physical and informational reagents for association studies by mapping out human LD structure

LD varies across samples African reference (YRI) there are large differences in LD between different human populations… European reference (CEU) … and even between samples from the same population. Other European samples

Sample-to-sample LD differences make tagSNP selection problematic groups of SNPs that are in LD in the HapMap reference samples may not be in a future set of clinical samples… … and tags that were selected based on LD in the HapMap may no longer work (i.e. represent the SNPs they were supposed to) in the clinical samples… … possibly resulting in missed disease associations.

Natural marker allele frequency differences confound association testing reference samples: ~ 120 chromosomes cases: 500-2,000 chromosomes controls: 500-2,000 chromosomes the HapMap reference samples are much smaller than clinical sample sizes difficult to accurately assess both marker allele frequency (single-SNP or haplotype frequency) in the clinical samples and naturally occurring variation of marker allele frequency differences between cases and controls AF(cases) AF(controls) therefore difficult to assess statistical significance of candidate associations

We are developing technology for assessing sample-to- sample variance in silico reference cases controls tag evaluation tag selection association testing we estimate LD differences between HapMap and future clinical samples… “cases” “controls” …by generating “computational” samples representing future clinical samples… … and use computational “proxy” samples for tabulating LD and allele frequency differences.

Two methods of computational sample generation “HapMap” “cases” “controls” HapMap Method 1. “Data-relevant Coalescent”. This algorithm uses a population genetic model to connect mutations in the HapMap reference to mutations in future clinical samples. Full model but computationally slow. Method 2. The PAC method (product of approximate conditionals, Li & Stephens). This method constructs “new” samples as mosaics of existing haplotypes, mimicking the effects of recombination. An approximation but fast.

Computational samples HapMap (CEU) Computational (PAC) Computational (Coalescent) Extra genotypes (Estonia)

MARKER EVALUATION with computational samples test if markers selected from the HapMap continue to “tag” other SNPs in their original LD group

MARKER SELECTION with computational samples selecting tags in multiple consecutive sets of computational samples and choosing for the association study the best-performing tags

ASSOCIATION TESTING with computational samples “cases” “controls” “cases” “controls” “cases” “controls” tabulating ΔAF in “cases” vs. “controls” in multiple consecutive computational pairs of samples provides the natural range of allele frequency differences to decide if a candidate association is statistically significant AF(cases) AF(controls)

Do computational samples represent future clinical genotypes realistically? we quantify the quality of representation by comparing the correlation of LD between corresponding pairs of markers (i.e. ask if two markers were in strong LD in one set of samples, are they ALSO in strong LD in the other set?

LD difference -- comparison to extra experimental genotypes / / / we have analyzed two extra genotype sets collected at the HapMap SNPs in three genome regions, from our clinical collaborators (Prof. Thomas Hudson, McGill; Prof. Stanley Nelson, UCLA)

AF difference -- comparisons to extra experimental genotypes according to our limited initial test, computational samples can represent future clinical samples well for estimating sample-to-sample variability

A new marker selection and association testing software tool data visualization reference samples representative computational samples representative computational sample generation advanced tag selection functionality gene annotations tags LD views gene annotations overlaid on physical map of SNPs (i.e. the human genome sequence) association statistics advanced association testing functionality multi-level user customization including user conveniences e.g. tag prioritization based on SNP assay score

User community companies designing new generations of whole-genome or specialized SNP arrays researchers comparing alternative platforms (e.g. Affymetrix 500K and the Illumina 300K ) most suitable for their study clinical researchers designing candidate gene studies researchers designing second-stage follow-up studies in specific genome regions after an initial genome scan (our methods can take advantage of first-stage data already available in the clinical samples) the association testing features should be useful for analysts regardless of study design

Base calling and SNP detection in sequence traces including 454 data Aaron Quinlan

Base calling and SNP detection in sequence traces including 454 “pyrogram” data PolyBayes was originally written to find SNPs in clonal sequences in large SNP discovery projects medical re-sequencing projects require the detection of SNPs in heterozygous diploid sequence traces C C G G A T C G 5’ 3’ 5’ 3’

Heterozygote detection in sequence traces Ind. 1 Ind. 2 Ind. 3 Ind. 4

Individual traces we use a machine learning method (Support Vector Machine, SVM) to recognize characteristic features of homozygous vs. heterozygous positions

Aggregating information from multiple traces forward/reverse sequences from same individual P(GT ) =.993 resultant genotype call P(GT | Read) =.98 P(GT | Read) =.87

Discovery vs. genotyping Prior(CT) =.001 discovery: “uninformed prior” don’t know if site is polymorphic have to test each site Prior(CT) = 0.34 genotyping: “informed prior” 1. site is known to be polymorphic 2. allele frequency estimate

Our heterozygote detection works better than other methods Performance Measured on ~1000 Alignments covering 500Kb Region of Chromosome 4 Fraction of Data Analyzed False Discovery Rate Fraction of Heterozygotes Found Fraction of Homozygotes Found PolyBayes %97.8% Polyphred %82.63%

Base calling for “pyrograms” From NCBI Trace Archive we have access to standardized data formats readout in pyrosequencing is based on instantaneous detection of base incorporation… multiple bases of the same type are incorporated in the same cycle TCAGGGGGGGGGGGACGACAAGGCGTGGGGA the identity of consecutive bases is very reliable but the length of mono- nucleotide runs (base number) is difficult to quantify (great for re-sequencing; but problematic for de novo sequencing)

SNP genotyping with pyrosequencers Nordfors, et. al. Human Mutation 19: (2002) we are in the process of identifying discriminating pyrogram features to use in our machine-learning methods to recognize polymorphic positions within traces

Somatic mutation detection Michael Stromberg

Somatic mutations © Brian Stavely, Memorial University of Newfoundland the detection of somatic mutations, and their distinction from inherited polymorphism, is important to separate pre-disposing variants from mutations that occur during disease progression e.g. in cancer 1. detect the mutations 2. classify whether somatic or inherited

Detecting somatic mutations with comparative data based on comparison of cancer and normal tissue from the same individual often cancer tissue is highly heterogeneous and the somatic mutant allele may represent at low allele frequency

Detecting somatic mutations with subtraction if normal tissue samples are not available, we detect SNPs in cancer tissue against e.g. the human genome reference sequence subtract apparent mutations that are present in sequence variation databases search for evidence that these mutations are genetic

Detecting somatic mutations with subtraction we have applied our methods for somatic mutation detection in murine mitochondrial sequences heteroplasmyhomoplasmy we will be applying our methods for human nuclear DNA from our collaborators

Using new haplotype resources to connect genotype and clinical outcome in pharmaco-genetic systems the HapMap was designed as a tool to detect high-frequency (common) phenotypic (e.g. disease-causing) alleles important drug metabolizing enzymes are relatively few in number, well studied, are at known genome locations, many associated phenotypes are well described many functional alleles are known, and of high frequency (common) multi-SNP alleles are highly predictive of metabolic phenotype clinical phenotype (adverse drug reaction) less predictable ideal candidate for applying haplotype resources

Multi-marker haplotypes as accurate markers for ADRs? functional allele (known metabolic polymorphism) genetic marker (haplotype) in genome regions of drug metabolizing enzyme (DME) genes molecular phenotype (drug concentration measured in blood plasma) clinical endpoint (adverse drug reaction) computational prediction based on haplotype structure

Resources specifics of enzyme- drug interactions LD and haplotype structure in the HapMap reference samples, based on high-density SNP map functional alleles existing DME P genotyping chips

Evolutionary questions mutation age? mutations single-origin or recurrent? geographic origin of mutations? analysis based on complete local variation structure and haplotype background of functional mutations specifics of the selection process that led to specific functional alleles?

Proposed steps of analysis haplotypes vs. metabolic phenotype? complete polymorphic structure? ethnicity? additional functional SNPs? haplotypes vs. functional alleles? haplotype block? functional allele (genotype) metabolic phenotype clinical phenotype (ADR) haplotype haplotypes vs. ADR phenotype?