Download presentation
Presentation is loading. Please wait.
Published byMerryl Miranda Bennett Modified over 9 years ago
1
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009
2
SNPs have been the main way of quantifying genetic variation Attention is now switching to other harder to detect variations Structural variations possibly account for a large portion of genetic variance Structural variations include –Insertions –Deletions –Inversions –Translocations –Copy number variations (CNVs) SNPs are old news
3
Copy Number Variation: What does a CNV look like? Donor Reference
4
How do we find CNVs? De novo sequence assembly is hard Resequencing is now an option with low-cost next gen sequencing This project aims to find CNVs using next gen sequencing reads.
5
Paired-end sequencing Illumina website
6
Paired-end sequencing Read lengths are a known size Insert length has a distribution Output:
7
The Computational Problem Given: A set of paired-end reads The mapping positions in the reference Output: 1.A set of CNVs 2.An estimation of the boundaries of each CNV 3.For each CNV an associated probability for the number of copies Read and Mapping Quality
8
Proposed Method 1.Use discordant read pairs as CNV signal 2.Cluster discordant read pairs that explain the same CNV 3.Estimate CNV boundaries based on clustered reads 4.For each cluster calculate the probability of the number of copies = 1,2,3…
9
Discordant read pair Donor Reference Concordant
10
Discordant read pair Donor Reference Discordant
11
Clustering Discordant Read Pairs Use a greedy approach 1. Pick any discordant read 2.Compare with all other discordant reads and group any that are within a given distance 3.Do this until no reads can be clustered together This sounds problematic but with the right assumptions it works Assume to know the maximum insert length Assume that the reverse read maps into the second copy and the forward read maps into the first copy Assume that CNVs are far apart
12
Read Pair Cluster Distance x min R+M I N N = X min + R + M I
13
Read Pair Cluster Distance x max R N N = X min + R+ M I N = X max + R X max - X min = M I if otherwise
14
Estimating Boundaries Have a set of clusters now Simple Boundary estimation: Left bounary = Right bounary =
15
Estimating the number of copies Utilize coverage - Position Coverage ~ Poisson(coverage) For each cluster: Perform a goodness of fit test for each coverage level (ie. Number of copies = 2 => coverage’ = 2*coverage) The coverage level that gives the best fit is the most likely Sadly this did not work :( So I resorted to estimating by looking at the ratio of the mean Coverage to the expected.
16
Wrote simulator tool in C –Simulates FASTQ paired-end reads –Reads and writes MAQ files –Computes Coverage Generated 10MB random genome using mouse chr 19 Inserted 5 CNVs spread across the genome Generated reads at 40x Coverage Mapped to fake reference using MAQ Applied the previously mentioned method Simulation
17
Found all of the CNVs No False Positives Worked exactly as predicted, because reads were perfect. Results CNVPositionLengthMinMaxPredicted length Copy Ratio (CNV) Copy Ratio (Ref) CNV 11000 99819689702.031.003 CNV 210,000100010,00010,9679671.91.97 CNV 320,000200019,99921,96719672.091.09 CNV 424,000 (200,300) 200024,15425,96718131.951.04 CNV 51,000,000 (2000) 10,0001,002,0001,009,9687,9681.92.99
18
Applied method to real mouse sequence data 40x coverage on chromosome 17 for CAST mouse strain Found 1456 possible CNVs Most of them were crazy looking Zeroed in on 5 that look interesting Results CNVMinMaxPredicted length Copy Ratio (CNV) CNV 127343569274715981280292.94 CNV 23632476936348359235900.85 CNV 3341021493629292621907770.77 CNV 419711195199077791965840.71 CNV 53139056631446686561202.4 Obviously the method needs some work
19
Future Work Table from Lee, et al. shows that the number of perfectly mapped reads that are discordant is very small Their method considers all possible mapping positions for each pair My method needs to consider this and toss MAQ
20
Future Work Replace the ad-hociness with a more formal probabilistic framework Consider all high-quality mapping positions (take into account low quality mappings) Consider the problem of repeated sequences
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.