Download presentation
Presentation is loading. Please wait.
Published byRobert Brown Modified over 9 years ago
1
Genotype Calling Matt Schuerman
2
Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual has two copies of the SNP Probes can be used to measure how well a particular SNP matches values Need a reliably way to declare values based on probe measurements
3
Example Probe Reads
4
Computational Problem Given a set of data points how can we partition them to maximize similarity within subsets? The clustering problem Similarity function arbitrary, but often based on statistical or distance measures Several accepted algorithms
5
Standard Solutions Algorithms exist which call HapMap genotypes with >99% accuracy Not general, many hidden parameters tuned to work on existing data Other algorithms require prior knowledge such as how many clusters are present Again, not general
6
My Solution Wanted a more general method with few tuned parameters Mine has almost no “tuned” parameters Wanted a fast solution Many accepted clustering algorithm have exponential run times Mine is O(n 2 ), but closer to linear in practice
7
My Solution 1. Convolve gaussian kernel over data to find initial cluster candidates 2. Iteratively re-calculate cluster parameters and then re-assign data points to clusters 3. Assign calls to clusters based on ratio of probe measurements
8
Phase 1: Initial clusters Bin data points to grid Convolve with a 5x5 gaussian kernel All peaks are considered potential clusters
9
Phase 2: Cluster Iteration While the clusters are changing … Calculate the mean position and covariance matrix of each cluster Merge clusters within 3 standard deviations of each other using Mahalanobis distance Assign each data point to the cluster with the shortest Mahalanobis distance
10
Phase 2: Cluster Iteration Iteration 1 …
11
Phase 2: Cluster Iteration Iteration 2 …
12
Phase 2: Cluster Iteration Iteration 3 …
13
Phase 2: Cluster Iteration Iteration 4, no change so done!
14
Phase 3: Assigning calls Based on the ratio of x to y at the center of each cluster If y/x ~ 1.3, then call as BB If y/x ~ 1, then call as AB If y/x ~ 0.7, then call as AA If 2 or 3 clusters are present, then find which is closest to these values
15
Results Clustering works much better when done within populations Algorithm’s performance is comparable across all populations Testing 1111 SNPs in the Affy 100K XBA CEU dataset found to be 96.47% accurate
16
Results: Example Assignment Ignore point at (10,10). One incorrect call in black.
17
Results Sometimes assigning calls is problematic Sometimes clusters get improperly split Sometimes clusters get improperly merged Sometimes the grouping is right, but one of the clusters was miscalled Could probably be fixed if set ratios more precisely
18
Results: Sample Split Error
19
Results: Sample Merge Error
20
Conclusions Accuracy is close to that of best published algorithms Faster run time Simpler approach with less tuning Need to run more data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.