Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual.

Genotype Calling Matt Schuerman

Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual has two copies of the SNP Probes can be used to measure how well a particular SNP matches values Need a reliably way to declare values based on probe measurements

Example Probe Reads

Computational Problem Given a set of data points how can we partition them to maximize similarity within subsets? The clustering problem Similarity function arbitrary, but often based on statistical or distance measures Several accepted algorithms

Standard Solutions Algorithms exist which call HapMap genotypes with >99% accuracy Not general, many hidden parameters tuned to work on existing data Other algorithms require prior knowledge such as how many clusters are present Again, not general

My Solution Wanted a more general method with few tuned parameters Mine has almost no “tuned” parameters Wanted a fast solution Many accepted clustering algorithm have exponential run times Mine is O(n 2 ), but closer to linear in practice

My Solution 1. Convolve gaussian kernel over data to find initial cluster candidates 2. Iteratively re-calculate cluster parameters and then re-assign data points to clusters 3. Assign calls to clusters based on ratio of probe measurements

Phase 1: Initial clusters Bin data points to grid Convolve with a 5x5 gaussian kernel All peaks are considered potential clusters

Phase 2: Cluster Iteration While the clusters are changing … Calculate the mean position and covariance matrix of each cluster Merge clusters within 3 standard deviations of each other using Mahalanobis distance Assign each data point to the cluster with the shortest Mahalanobis distance

Phase 2: Cluster Iteration Iteration 1 …

Phase 2: Cluster Iteration Iteration 4, no change so done!

Phase 3: Assigning calls Based on the ratio of x to y at the center of each cluster If y/x ~ 1.3, then call as BB If y/x ~ 1, then call as AB If y/x ~ 0.7, then call as AA If 2 or 3 clusters are present, then find which is closest to these values

Results Clustering works much better when done within populations Algorithm’s performance is comparable across all populations Testing 1111 SNPs in the Affy 100K XBA CEU dataset found to be 96.47% accurate

Results: Example Assignment Ignore point at (10,10). One incorrect call in black.

Results Sometimes assigning calls is problematic Sometimes clusters get improperly split Sometimes clusters get improperly merged Sometimes the grouping is right, but one of the clusters was miscalled Could probably be fixed if set ratios more precisely

Results: Sample Split Error

Results: Sample Merge Error

Conclusions Accuracy is close to that of best published algorithms Faster run time Simpler approach with less tuning Need to run more data

Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual.

Similar presentations

Presentation on theme: "Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual.

Similar presentations

Presentation on theme: "Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual."— Presentation transcript:

Similar presentations

About project

Feedback