Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual.

Similar presentations


Presentation on theme: "Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual."— Presentation transcript:

1 Genotype Calling Matt Schuerman

2 Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual has two copies of the SNP Probes can be used to measure how well a particular SNP matches values Need a reliably way to declare values based on probe measurements

3 Example Probe Reads

4 Computational Problem Given a set of data points how can we partition them to maximize similarity within subsets? The clustering problem Similarity function arbitrary, but often based on statistical or distance measures Several accepted algorithms

5 Standard Solutions Algorithms exist which call HapMap genotypes with >99% accuracy Not general, many hidden parameters tuned to work on existing data Other algorithms require prior knowledge such as how many clusters are present Again, not general

6 My Solution Wanted a more general method with few tuned parameters Mine has almost no “tuned” parameters Wanted a fast solution Many accepted clustering algorithm have exponential run times Mine is O(n 2 ), but closer to linear in practice

7 My Solution 1. Convolve gaussian kernel over data to find initial cluster candidates 2. Iteratively re-calculate cluster parameters and then re-assign data points to clusters 3. Assign calls to clusters based on ratio of probe measurements

8 Phase 1: Initial clusters Bin data points to grid Convolve with a 5x5 gaussian kernel All peaks are considered potential clusters

9 Phase 2: Cluster Iteration While the clusters are changing … Calculate the mean position and covariance matrix of each cluster Merge clusters within 3 standard deviations of each other using Mahalanobis distance Assign each data point to the cluster with the shortest Mahalanobis distance

10 Phase 2: Cluster Iteration Iteration 1 …

11 Phase 2: Cluster Iteration Iteration 2 …

12 Phase 2: Cluster Iteration Iteration 3 …

13 Phase 2: Cluster Iteration Iteration 4, no change so done!

14 Phase 3: Assigning calls Based on the ratio of x to y at the center of each cluster If y/x ~ 1.3, then call as BB If y/x ~ 1, then call as AB If y/x ~ 0.7, then call as AA If 2 or 3 clusters are present, then find which is closest to these values

15 Results Clustering works much better when done within populations Algorithm’s performance is comparable across all populations Testing 1111 SNPs in the Affy 100K XBA CEU dataset found to be 96.47% accurate

16 Results: Example Assignment Ignore point at (10,10). One incorrect call in black.

17 Results Sometimes assigning calls is problematic Sometimes clusters get improperly split Sometimes clusters get improperly merged Sometimes the grouping is right, but one of the clusters was miscalled Could probably be fixed if set ratios more precisely

18 Results: Sample Split Error

19 Results: Sample Merge Error

20 Conclusions Accuracy is close to that of best published algorithms Faster run time Simpler approach with less tuning Need to run more data


Download ppt "Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual."

Similar presentations


Ads by Google