Download presentation
Presentation is loading. Please wait.
1
Review of paper submitted to NAR - confidential
InPhaDel: Integrative shotgun and proximity-ligation sequencing to phase deletions with single nucleotide polymorphisms Anand Patel 1,2, Siddarth Selvaraj 1 Vikas Bansal 3and Vineet Bafna 1,2 1 Bioinformatics and Systems Biology Program. UCSD 2 Dept. of Computer Science and Engineering. UCSD 3 Dept. of Pediatrics. School of Medicine. UCSD David Amar and Ron Zeira Review of paper submitted to NAR - confidential
2
Introduction Phasing. Clinical importance. WGS data.
Discordant vs. concordant reads. HiC data. Phasing using HiC data.
3
Phasing Sequencing produces a genotype.
Phasing: inference of the haplotypes.
4
Phasing using trios
5
Phasing using population
6
Phasing with deletions
Donor == Phasing option
7
Clinical importance Creates compound cis/trans interactions.
Miller syndrome – SNPs knock both copies of DHODH. Hemizygosity increases risk for autism and schizophrenia. Hemizygosity with rare mutations produce diverse syndromes (velo-cardio-facial, DiGeorge, pseudoxanthoma elasticum).
8
WGS: Paired end reads
9
WGS: concordant, discordant reads
10
HiC data
11
Phasing using HiC HaploSeq – Selvaraj et al. Nature Biotech 13.
De novo phasing without trio information. Basic observation: HiC reads are predominantly (>99%) intrahaplotype. Haplotype blocks are assembled by connecting reads that connect different variants.
12
This paper Method for phasing deletions.
Uses both HiC and WGS data from a single individual. Method is based on supervised learning. Models are learned on chromosomes 2,3, and 4 and are validated on the other chromosomes. Results estimate that HiC enhanced phasing in ~30% of the deletions.
13
Outline Basic idea. Features. Simulations. CV results. Test results.
Additional validations.
14
InPhaDel basic idea Input Output
Heterogeneous SNVs (genomic positions). Sequences of parental alleles (pA, pB). Deletion calls – each is a breakpoint pair (a,b). WGS reads, HiC reads. Output A model that can classify deletions into: pA, pB, homozygous, or inconsistent.
15
WGS features Count concordant read pairs inside deletion and around deletion ends. Count discordant reads. Repeat of each allele and normalize.
16
HiC features Count reads inside deletion, around deletion BP and around cut sites for each orientation and each allele.
17
Boosting performance Recap: 423 high quality deletions from NA12878
Instance is a deletion Features are the read counts (by type) Labels are pA, pB, hom, or inconsistent 423 high quality deletions from NA12878 Manual and automatic annotation >1kb Initial CV results had high variance and sometimes low performance Solution: simulate additional instances
18
Simulating instances Chromosomes 2,3,4 had 99 deletions
25 homozygous, 60 heterozygous and 14 incorrect Yoruba. Simulated 6 pairs of chrom. 2,3,4 each with 50 deletions. Deletion size was randomly selected using a dist. from Mills et al. Overall: 1050 deletions 256 homozygous, 263 pA, 263 pB, 268 incorrect. After filtering: 159 hom., 261 pA, 273 pB, 263 incorrect (criteria not explained).
19
Simulating WGS reads Given the new deletions we have a new reference genome S’ Use wgsim to draw positions for reads Reads are created at 81x coverage 500bp fragment size 100bp read size
20
Simulating HiC No HiC data simulator.
Shuffle each read pair (a,b) into a new position on the new reference genome. Keep: Same distance between read pairs. Same read pair orientation (+/+,-/-,+/-,-/+). Proximity to a HindIII cut site. Use read pairs of distance < 40kb
21
Simulating HiC Authors mention that we only want to keep the diagonal similar HiC data of high proximity reads contain allele interactions within the same haplotype (Selvaraj et al. 2013)
22
Training classifiers Nested 5-fold cross validation.
K-nearest neighbors. K= 2, 4, 8, 16, 32 Linear SVM C (margin parameter) = 0.1, 1, 10, 100 Random forest. Tree depth (2, 5, 10, 20), num trees (10,20,50,100)
23
Results: cross validation
3 chrom. 3 chrom. + simulated chrom. chroms 1-10
24
WGS/HiC data effects
25
Summarizing features Reduce to 6 features. Total WGS concordant reads
Total WGS discordant reads WGS+HiC concordant pA/pB WGS discordant pA/pB 32 features 6 features
26
Results: test set 256 deletions in chrom. withheld from training.
Total accuracy 85.9%±4.3%.
27
Results: test set 2
28
Differences due to classification errors or exclusion?
WGS/HiC data effects Differences due to classification errors or exclusion?
29
Additional validations
Pendelton et al. single-molecule WGS. 1323 deletions > 1kb. 336 annotated maternal/paternal. 407/423 deletion overlap. 194 annotated. 93% of shared deletions have same phasing. InPhaDel accuracy is 73% (of 336 deletions) What were the features of the tested instances? HiC data? How many accurate predictions are not in the training? Top 8 deletions were validated using PCR.
30
Summary Novel method for phasing deletions.
Novel use of HiC data and a novel HiC read shuffle simulator. HiC responsible for phasing 33% of deletions. Take home message: HiC can be used for new purposes.
31
Comments I Better to test the whole flow (HaploSeq + InPhaDel) on a new sample (GM06690?). CV results: Report results separately the simulated and real instances. Should report only on the real instances. HiC contribution of 33% - was the phasing of SNVs based on HaploSeq only? Simulation contribution does not seem significant. Did they see the results on the test set and then added the simulations? OVERFITTING?
32
Comments II Deletion data: Results on feature importance are missing
Manual selection!, easy cases? Excluded deletions: the criteria is not well explained The data of Pendelton et al. had three times more deletions Size filter: tailored for >1kb deletions. Mills – most dels < 1kb. Results on feature importance are missing Comparison to Pendelton: input/output? What were the features of the tested instances? HiC data? How many accurate predictions are not in the training?
33
Other comments HiC shuffler: Visualization should be improved
How to handle deletions? Seems to smooth the data Loses 3D sturcture Visualization should be improved Some methods are not adequately explained
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.