How to Solve NP-hard Problems in Linear Time

Slides:



Advertisements
Similar presentations
Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,
Advertisements

Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
C&O 355 Lecture 23 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
CHAPTER 8 A NNEALING- T YPE A LGORITHMS Organization of chapter in ISSO –Introduction to simulated annealing –Simulated annealing algorithm Basic algorithm.
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Approximation Algoirthms: Semidefinite Programming Lecture 19: Mar 22.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Quantum Algorithms II Andrew C. Yao Tsinghua University & Chinese U. of Hong Kong.
The Hardness of Cache Conscious Data Placement Erez Petrank, Technion Dror Rawitz, Caesarea Rothschild Institute Appeared in 29 th ACM Conference on Principles.
IEEE.AM/MMES Tenerife RELIABILITY STUDY OF MESH NETWORKS MODELED AS RANDOM GRAPHS. Louis Petingi Computer Science Dept. College of Staten Island.
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
Todd J. Treangen, Steven L. Salzberg
Telex Magloire Ngatched Centre for Radio Access Technologies University Of Natal Durban, South-Africa Telex Magloire Ngatched Centre for Radio Access Technologies.
Network Aware Resource Allocation in Distributed Clouds.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Information Coding in noisy channel error protection:-- improve tolerance of errors error detection: --- indicate occurrence of errors. Source.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
1 Network Coding and its Applications in Communication Networks Alex Sprintson Computer Engineering Group Department of Electrical and Computer Engineering.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Coding Theory Efficient and Reliable Transfer of Information
Lecture 4 Haplotype assembly. Variation calling, diploid genomes CAGCTACATCACGAGCATCGACGAGCTAGCGAGCGATCGCGA CAGCTACATAACGAGCATCGACCAGCTAGCGAGCTATCGCCA.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
A B C D E F A ABSTRACT A novel, efficient, robust, feature-based algorithm is presented for intramodality and multimodality medical image registration.
1 Channel Coding (III) Channel Decoding. ECED of 15 Topics today u Viterbi decoding –trellis diagram –surviving path –ending the decoding u Soft.
1 Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedon, J. Pearlman Presenter: Prof. Fillia Makedon Dartmouth College.
Meiotic gene conversion in humans: rate, sex ratio, and GC bias Amy L. Williams June 19, 2013 University of Chicago.
Optimal Relay Placement for Indoor Sensor Networks Cuiyao Xue †, Yanmin Zhu †, Lei Ni †, Minglu Li †, Bo Li ‡ † Shanghai Jiao Tong University ‡ HK University.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Joint Decoding on the OR Channel Communication System Laboratory UCLA Graduate School of Engineering - Electrical Engineering Program Communication Systems.
A Graph Theoretic Approach to Cache-Conscious Placement of Data for Direct Mapped Caches Mirza Beg and Peter van Beek University of Waterloo June
Information Theory of High-throughput Shotgun Sequencing David Tse Dept. of EECS U.C. Berkeley Tel Aviv University June 4, 2012 Research supported by NSF.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Yufeng Wu and Dan Gusfield University of California, Davis
Introduction to SNP and Haplotype Analysis
Extensive-Form Game Abstraction with Bounds
Information Complexity Lower Bounds
Constrained Hidden Markov Models for Population-based Haplotyping
Sequential Algorithms for Generating Random Graphs
Science of Information: Case Studies in DNA and RNA assembly
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Markov Random Fields with Efficient Approximations
Genomic Data Clustering on FPGAs for Compression
Amplify-and-Forward Schemes for Wireless Communications
Coding and Interleaving
Jin Zhang, Jiayin Wang and Yufeng Wu
Do You Want to Build a Transcriptome?
Clustering Using Pairwise Comparisons
Introduction to SNP and Haplotype Analysis
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
On the k-Closest Substring and k-Consensus Pattern Problems
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Miguel Griot, Andres I. Vila Casado, and Richard D. Wesel
CS 394C: Computational Biology Algorithms
Approximation Algorithms for the Selection of Robust Tag SNPs
IWGS workflow. iWGS workflow. A typical iWGS analysis consists of four steps: (1) data simulation (optional); (2) preprocessing (optional); (3) de novo.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Lecture 2-6 Complexity for Computing Influence Spread
Presentation transcript:

How to Solve NP-hard Problems in Linear Time David Tse Stanford University

Computational Genomics Information Theory

The communication problem C.E. Shannon 1948: Figure 1 Computation: Information: Let’s find the information limit first. Maximum-likelihood decoding is NP-hard

Information before computation data rate probability of error 1 capacity Capacity-achieving codes that can be efficiently decoded are discovered (after 70 years).

Computational genomics De novo genome assembly Bresler, Bresler &T., BMC Bioinformatics, 2013. Shomorony, Courtade &T., Bioinformatics, 2016. Kamath, Shomorony, Xia, Courtade & T., Genome Research 2017. De novo transcriptome assembly Kannan, Pachter & T., 2016. Haplotype phasing Chen, Kamath, Suh & T., ICML 2016.

23 Pairs of Chromosome Haplotype

major allele Maternal sequence minor allele Paternal sequence SNP

Haplotype phasing major allele minor allele

High-Throughput Sequencing read Read length << inter-SNP distance Don’t know which chromosome each read comes from and also no linking information!

Linking Information Some types of reads come in a pair. mate pair reads 10X generate bar-coded reads that are 10’s to 100’s of SNPs apart. Long reads like PacBio or ONT can also provide linking information. each pair a few SNPs apart

Haplotype phasing with noisy linking reads Each pair of linked reads gives noisy parity information.

Information before computation How many reads are needed to phase reliably? How to phase efficiently?

= Haplotype Phasing Community Recovery Adamic & Glance 2005 Chen, Kamath, Suh, T. , “Community recovery for graphs with locality”, ICML 2016.

Back to Example Community a Community b

Combinatorial optimization approach Maximum likelihood community recovery is NP-hard. Reduction from MAXCUT. Hajek, Wu & Xu, “Achieving Exact Cluster Recovery Threshold via Semidefinite Programming, Trans. Info. Theory, 2016.

Information theoretic approach Uniform linking model: Linking reads are equally likely to be between any pair of SNPs. Information limit: Optimal number of reads for phasing n >> 1 SNPs: p = error rate Hajek, Wu & Xu, “Achieving Exact Cluster Recovery Threshold via Semidefinite Programming, Trans. Info. Theory, 2016.

Coverage depth vs error rate

Simulations n = 100,000 SNPs information limit # of reads 100 Monte Carlo runs to get each point. Each run takes ~ 15 seconds on a Mac Air

Genie-aided lower bound # of linking reads > Suppose a genie tells you the correct community of all SNPs except one:

Efficient approximate recovery Community a Community b a b a b

Two-step Algorithm Step I: approximate recovery using spectral algorithm. Step II: refinement using majority vote for each SNP.

Zheng et al, Nature Biotech, 2016 Contact maps uniform linking model 10X data Zheng et al, Nature Biotech, 2016 local linking model Chen et al, ICML, 2016

Spectral-Stitching Algorithm Step I: approximate recovery on overlapping blocks via the spectral algorithm Step II: stitching across blocks

Spectral-Stitching Algorithm Step I: approximate recovery on overlapping blocks via spectral algorithm. Step II: stitching across blocks Step III: refinement using majority vote for each SNP

Evaluation on 10X data Dataset: NA12878_WGS 4 metrics: Deligiannis, Jiang, Zhu & T. Dataset: NA12878_WGS 4 metrics: # of unphased SNPs N50 of phased blocks short switch error rate long switch error rate Zheng et al, Nature Biotech, 2016

N50 of phased blocks 50% SNP Chromosome Phased blocks N50

Switch errors within a phased block Short switch error Long switch error Output 1 Ground truth 1 1 1 1 1

Runtimes of spectral stitching Single core Intel i7-5500U CPU @ 2.40GHz

Conclusion Many computational genomics problems have NP-hard combinatorial optimization formulation. But often there are efficient alternatives to get to the information limit. Haplotype phasing is a case study. Optimality on theoretical models translates to benefits on real data.