Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual.

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Basic Gene Expression Data Analysis--Clustering
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Which Phenotypes Can be Predicted from a Genome Wide Scan of Single Nucleotide Polymorphisms (SNPs): Ethnicity vs. Breast Cancer Mohsen Hajiloo, Russell.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering.
SNP chips Advanced Microarray Analysis Mark Reimers, Dept Biostatistics, VCU, Fall 2008.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
What is Cluster Analysis?
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.
Birch: An efficient data clustering method for very large databases
Evaluating Performance for Data Mining Techniques
Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.
Chapter 12 – Discriminant Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
This paper was presented at KDD ‘06 Discovering Interesting Patterns Through User’s Interactive Feedback Dong Xin Xuehua Shen Qiaozhu Mei Jiawei Han Presented.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
Methods in genome wide association studies. Norú Moreno
Genotype Calling Jackson Pang Digvijay Singh Electrical Engineering, UCLA.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
Flat clustering approaches
Sampling Theory and Some Important Sampling Distributions.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
A new approach for the gamma tracking clustering by the deterministic annealing method François Didierjean IPHC, Strasbourg.
Clustering Usman Roshan CS 675. Clustering Suppose we want to cluster n vectors in R d into two groups. Define C 1 and C 2 as the two groups. Our objective.
Clustering (1) Clustering Similarity measure Hierarchical clustering
Chapter 12 – Discriminant Analysis
Today Cluster Evaluation Internal External
Ch8: Nonparametric Methods
Classification of unlabeled data:
Accelerated Sampling for the Indian Buffet Process
Clustering Evaluation The EM Algorithm
Clustering.
AIM: Clustering the Data together
How Accurate is Pure Parsimony Haplotype Inferencing
ECE539 final project Instructor: Yu Hen Hu Fall 2005
Sampling Distribution
Sampling Distribution
KAIST CS LAB Oh Jong-Hoon
Mathematical Foundations of BME Reza Shadmehr
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Data Mining – Chapter 4 Cluster Analysis Part 2
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Introduction to Sensor Interpretation
Linear Discrimination
Hairong Qi, Gonzalez Family Professor
Introduction to Sensor Interpretation
Presentation transcript:

Genotype Calling Matt Schuerman

Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual has two copies of the SNP Probes can be used to measure how well a particular SNP matches values Need a reliably way to declare values based on probe measurements

Example Probe Reads

Computational Problem Given a set of data points how can we partition them to maximize similarity within subsets? The clustering problem Similarity function arbitrary, but often based on statistical or distance measures Several accepted algorithms

Standard Solutions Algorithms exist which call HapMap genotypes with >99% accuracy Not general, many hidden parameters tuned to work on existing data Other algorithms require prior knowledge such as how many clusters are present Again, not general

My Solution Wanted a more general method with few tuned parameters Mine has almost no “tuned” parameters Wanted a fast solution Many accepted clustering algorithm have exponential run times Mine is O(n 2 ), but closer to linear in practice

My Solution 1. Convolve gaussian kernel over data to find initial cluster candidates 2. Iteratively re-calculate cluster parameters and then re-assign data points to clusters 3. Assign calls to clusters based on ratio of probe measurements

Phase 1: Initial clusters Bin data points to grid Convolve with a 5x5 gaussian kernel All peaks are considered potential clusters

Phase 2: Cluster Iteration While the clusters are changing … Calculate the mean position and covariance matrix of each cluster Merge clusters within 3 standard deviations of each other using Mahalanobis distance Assign each data point to the cluster with the shortest Mahalanobis distance

Phase 2: Cluster Iteration Iteration 1 …

Phase 2: Cluster Iteration Iteration 2 …

Phase 2: Cluster Iteration Iteration 3 …

Phase 2: Cluster Iteration Iteration 4, no change so done!

Phase 3: Assigning calls Based on the ratio of x to y at the center of each cluster If y/x ~ 1.3, then call as BB If y/x ~ 1, then call as AB If y/x ~ 0.7, then call as AA If 2 or 3 clusters are present, then find which is closest to these values

Results Clustering works much better when done within populations Algorithm’s performance is comparable across all populations Testing 1111 SNPs in the Affy 100K XBA CEU dataset found to be 96.47% accurate

Results: Example Assignment Ignore point at (10,10). One incorrect call in black.

Results Sometimes assigning calls is problematic Sometimes clusters get improperly split Sometimes clusters get improperly merged Sometimes the grouping is right, but one of the clusters was miscalled Could probably be fixed if set ratios more precisely

Results: Sample Split Error

Results: Sample Merge Error

Conclusions Accuracy is close to that of best published algorithms Faster run time Simpler approach with less tuning Need to run more data