Informative SNP Selection Based on Multiple Linear Regression

Slides:



Advertisements
Similar presentations
Association Studies, Haplotype Blocks and Tagging SNPs Prof. Sorin Istrail.
Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Picking SNPs Application to Association Studies Dana Crawford, PhD SeattleSNPs PGA University of Washington March 20, 2006.
Wei-Bung Wang Tao Jiang
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
A dynamic program algorithm for haplotype block partitioning Zhang, et. al. (2002) PNAS. 99, 7335.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
SNP Selection University of Louisville Center for Genetics and Molecular Medicine January 10, 2008 Dana Crawford, PhD Vanderbilt University Center for.
Introduction to Animal Breeding & Genomics Sinead McParland Teagasc, Moorepark, Ireland.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Introduction to BST775: Statistical Methods for Genetic Analysis I Course master: Degui Zhi, Ph.D. Assistant professor Section on Statistical Genetics.
Factors to Consider in Selecting a Genotyping Platform Elizabeth Pugh June 22, 2007.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
A single-nucleotide polymorphism tagging set for human drug metabolism and transport Kourosh R Ahmadi, Mike E Weale, Zhengyu Y Xue, Nicole Soranzo, David.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
SNPs and the Human Genome Prof. Sorin Istrail. A SNP is a position in a genome at which two or more different bases occur in the population, each with.
Gene Hunting: Linkage and Association
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Clustering and optimization in genetic data: the problem of Tag-SNPs selection Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa** *
SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail.
Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Notes: Human Genome (Right side page)
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Introduction to SNP and Haplotype Analysis
Of Sea Urchins, Birds and Men
Constrained Hidden Markov Models for Population-based Haplotyping
Linkage: Statistically, genes act like beads on a string
Introduction to SNP and Haplotype Analysis
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
Approximation Algorithms for the Selection of Robust Tag SNPs
SNPs and CNPs By: David Wendel.
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky Thank you very much, Dr. Barneva for the kind introduction. My name is Jingwu He. Today, It is my great pleasure to give a talk at SUNY fredenia. My topic is ****. .

Outline SNPs, haplotypes, and genotypes Tagging problem formulation Tagging based on multiple linear regression Experimental results Today, I will first give an introduction to bioinformatics. Then describe the biological background of my research and present how I apply algorithms to solve three problems in computational genetics epidemiology: phasing, tagging, and disease susceptibility. Finally, I will conclude my talk with my future plans

Human Genome Length of Human Genome (DNA)  3 billion base pairs: A,C,G, or T. Our DNA is similar. 99.9% of DNA is common. Length of Human Genome sequence is about 3 billion base pairs ACTG. However, Our DNA are very similar. Let’s take the DNA sequence from NBA star Michael Jordan and myself. Align these two sequences. How much of DNA sequence do Mike and I have in common? Is it 50%, 80%, 90% or something else? Students, please guess a number? (Students begin to guess) congratulations! You will have a promising career in bioinformatics! The answer is 99.9%.

SNPs Genome difference between any two people  0.1% of genome These differences are Single Nucleotide Polymorphisms (SNPs). Total number of SNPs in human genome  107 SNP SNP SNP Actually, genome difference between any two people is about 0.1%. In total, 10 million SNPs exist in the human population. A A C A C G C C A . . . . T T C G G G G T C . . . . A G T C G A C C G . . . . A A C A C G C C A . . . . T T C G A G G T C . . . . A G T C A A C C G . . . . A A C A T G C C A . . . . T T C G G G G T C . . . . A G T C A A C C G . . . . A A C A C G C C A . . . . T T C G G G G T C . . . . A G T C G A C C G . . . .

Haplotyes and Genotypes Human = diploid organism: two different “copies” of each chromosome, one from mother, one from father. One copy from A . . . C A C C G C C A . . . . T T C G G G G G T C . . . . A G T C G G A C C G . . . . Another copy from A . . . C A G C C A . . . . T T C G G G T C . . . . A G T C A C C G . . . . C A C A A One copy from B . . . C A T T G C C A . . . . T T C G G G G G T C . . . . A G T C A A A C C G . . . . Another copy from B . . . C A C C G C C A . . . . T T C G G G G G T C . . . . A G T C G G A C C G . . . . Haplotype 1 from A Haplotype 2 from A Haplotype 3 from B Haplotype 4 from B Genotype 1 from A In human being, everybody has two different copies of each chromosome except sex chromosome, one from mother, one from father.. This figure shows person A's and person B's chromosome 10 . notice each person has two copies of chromosome 10 Since individuals differ in SNPS, we disregard all other base-pairs only keep SNPs. We describe the SNP sequences by using haplotype and genotype. Haplotye describes a single copy of SNP sequences in a chromosome. A pair of haplotypes make a genotype In this figure, there are 4 haplotypes: 2 from person A, another 2 from B. There are two genotypes; one from A, another form B. Genotype 2 from B Since individuals differ in SNPs, we keep only SNPs. Haplotype: SNPs in a single “copy” of a chromosome Genotype: A pair of haplotypes

Cause of Variation: Mutations and Recombinations One nucleotide is replaced with other G -> A One chromatid recombine with another.

Encoding Heterozygous homozygous SNPs are generally bi-allelic only two alleles in single SNP: wild type and mutation 0 stands for wide type, 1 stands for mutation As computer scientists, what are our favorite sequences? Binary sequences! We are so lucky. SNPs are generally bi-allelic, only two alleles in single SNP: wild type and mutation. Look at the figure on the top, we can see that in this 4 haplotypes A is wild type and G is the mutation. When these mutations are seen at a rate great than 1% in poputation, biologists consider these variations to be SNPs. Computer scientists use 0 for wild type and 1 for mutation. What about genotypes? The rule is if two haplotypes’ SNPs are homozygous (either 0,0, or 1,1), the genotype’s SNP is 0 or 1. But, if the two haplotypes’ SNPs are heterozygous, the genotype’s SNP is 2. From the rule, we can can see that haplotype sequences are (0,1) sequences and genotype sequences are (0,1,2) sequences. After changing notations, haplotype data becomes a matrix with 0, 1 notations Genotype data becomes a matrix with 0, 1, 2 notations. It comes an interesting problem: From a given genotype sequence, how can we obtain the correct two haplotype sequences that describe the genotype? Heterozygous homozygous

Outline SNPs, haplotypes, and genotypes Tagging problem formulation Tagging based on multiple linear regression Experimental results Today, I will first give an introduction to bioinformatics. Then describe the biological background of my research and present how I apply algorithms to solve three problems in computational genetics epidemiology: phasing, tagging, and disease susceptibility. Finally, I will conclude my talk with my future plans

Tagging Motivation Decrease SNP genotyping cost and data analysis Many SNPs are linked (strongly correlated) Genotype only informative SNPs  tag SNPs, other SNPs are inferred from tag SNPs Perform data analysis only on tag SNPs. Cost-saving ratio = m/k Use only tag SNPs to infer non-tag SNPs The values of some SNPs are strongly correlated. We say that these SNPs are linked. Then why do we need to spend a lot of money to genetype all SNPs. Let’s save money, by only genotyping informative SNPs, then infer the values of the other SNPs from the informative SNPs. We’ll call these informative SNPs “tag” SNPs If we have m tag SNPs and k total SNPs. The cost-saving ratio = m / k

Tagging Problem Problem formulation Step 1: Find tags (SNP position) in sample: Find tags (0, 1, 2) Step 2: Reconstruct complete haplotype Computation Methods Problem formulation Given the full pattern of all SNPs in a sample Find the minimum number of tag SNPs that will allow the reconstruction of the complete haplotype for each individual. Tag Selection Algorithm SNP Prediction Algorithm

Tagging Methods Tagging Methods ….. HapBlock (K. Zhang, M.S. Waterman, et al.) Greedy algorithm for tag selection Majority voting for prediction V. Bafna, B.V. Halldorson et al. Graph algorithm for tag selection STAMPA (E. Halperin and R. Shamir) Dynamic programming for tag selection ….. Tagging based on Multiple Linear Regression Greedy Selection Multiple Linear Regression for Prediction

SNP Prediction Algorithm Given the values of k tags of an unknown individual x and the known full sample S, a SNP prediction algorithm Ak predicts the value of a single non-tag SNP in x, which is x(k+1). Treat each non-tag SNP separately Predicting

Tag Selection based on Prediction Choose the optimal k tags It is NP-hard, m choose k (m= No. of total SNPs, k= No. of tags) Use Stepwise (greedy) Tag Selection Algorithm (STA) to reduce the cost and time Starts with the best tag t0, i.e., tag that minimizes error when predicting with Ak all other tags. Then STA finds such tag t1, which would be the best extension of {t0}, and continues adding best tags until reaching the set of tags of the given size k.

Projection Method for SNP Prediction possible resolutions s0 = . 2 . s2 = 1 . s1 = d0 d2 d1 tag t2 How to predict SNPs if we have limited number tags, we choose the one who has shortest distance to spanning to tag space. projections span(T) tag t1 Choose resolution minimizing its distance d to spanning of tag space span (T)

Data Sets genotyping 23 and 102 SNPs for 30 trios Daly et al 616 kilobase region of human Chromosome 5q31 genotyping 103 SNPs for 129 trios. Seven ENCODE regions from HapMap. Regions ENr123 and ENm010 from 2 population: 45 singles Han Chinese (HCB) and 44 singles Japanese(JPT). Three regions (ENm013, ENr112, ENr113) from 30 CEPH family trios obtained from HapMapSTAMPA (E. Halperin and R. Shamir) Two gene regions: STEAP and TRPM8 genotyping 23 and 102 SNPs for 30 trios

Experimental Results Directly to genotype data

Multivariate Linear Regression Tagging Genotype tagging uses fewer tags (e.g., up to two times less tags to reach 90% prediction accuracy) than STAMPA (E. Halperin and R. Shamir, ISMB 2005 and Bioinformatics) Statistical tagging Linear recombination of tags statistically cover non-tag SNPs Traditional methods use single tag to cover non-tag SNPs uses on average 30% fewer tags than IdSelect (C.S. Carlson et al. 2004) for statistical covering all SNPs.

Thank you Any Questions? Once again, thank all of you for listening my talk Are there any questions?