Provable Learning of Noisy-OR Networks

Slides:



Advertisements
Similar presentations
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Advertisements

Statistical perturbation theory for spectral clustering Harrachov, 2007 A. Spence and Z. Stoyanov.
Ch 7.7: Fundamental Matrices
Dimensionality Reduction PCA -- SVD
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Ch 7.3: Systems of Linear Equations, Linear Independence, Eigenvalues
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Carnegie Mellon AISTATS 2009 Jonathan Huang Carlos Guestrin Carnegie Mellon University Xiaoye Jiang Leonidas Guibas Stanford University.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
SYSTEMS OF LINEAR EQUATIONS
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Weighted Geometric Set Multicover via Quasi-uniform Sampling (ESA 2012) Kirk Pruhs (U. Pittsburgh) Coauthor: Nikhil Bansal (TU Eindhoven)
Institute for Advanced Study, April Sushant Sachdeva Princeton University Joint work with Lorenzo Orecchia, Nisheeth K. Vishnoi Linear Time Graph.
An Algorithmic Proof of the Lopsided Lovasz Local Lemma Nick Harvey University of British Columbia Jan Vondrak IBM Almaden TexPoint fonts used in EMF.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Efficient computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm Anders Eriksson and Anton van den Hengel.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Guaranteed Smooth Scheduling in Packet Switches Isaac Keslassy (Stanford University), Murali Kodialam, T.V. Lakshman, Dimitri Stiliadis (Bell-Labs)
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Estimating standard error using bootstrap
Correlation Clustering
Spectral clustering of graphs
Singular Value Decomposition and its applications
Differential Calculus of 3D Orientations
Chapter 7. Classification and Prediction
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Direct Methods for Sparse Linear Systems
Packing to fewer dimensions
New Characterizations in Turnstile Streams with Applications
Document Clustering Based on Non-negative Matrix Factorization
LECTURE 11: Advanced Discriminant Analysis
School of Computer Science & Engineering
Vitaly Feldman and Jan Vondrâk IBM Research - Almaden
Statistics in MSmcDESPOT
Sam Hopkins Cornell Tselil Schramm UC Berkeley Jonathan Shi Cornell
Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.
Systems of First Order Linear Equations
LSI, SVD and Data Management
Density Independent Algorithms for Sparsifying
CSE 4705 Artificial Intelligence
Packing to fewer dimensions
Degree and Eigenvector Centrality
Singular Value Decomposition
Polynomial Optimization over the Unit Sphere
CIS 700: “algorithms for Big Data”
Matrix Martingales in Randomized Numerical Linear Algebra
Filtering and State Estimation: Basic Concepts
An Introduction to Variational Methods for Graphical Models
CSCI B609: “Foundations of Data Science”
On the effect of randomness on planted 3-coloring models
Parallelization of Sparse Coding & Dictionary Learning
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
~ Least Squares example
Markov Random Fields Presented by: Vladan Radosavljevic.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
~ Least Squares example
Lecture 6: Counting triangles Dynamic graphs & sampling
Relative Perturbations for s.d.d. Matrices with applications
Packing to fewer dimensions
Role of Stein’s Lemma in Guaranteed Training of Neural Networks
Lecture 16. Classification (II): Practical Considerations
Alexandr Andoni, Rishabh Dudeja, Daniel Hsu, Kiran Vodrahalli
Rong Ge, Duke University
Presentation transcript:

Provable Learning of Noisy-OR Networks Rong Ge Duke University Joint work with Sanjeev Arora, Tengyu Ma, Andrej Risteski “Provable Learning of Noisy-OR Networks” STOC 2017 arxiv:1612.08795 “New practical algorithms for learning Noisy-OR networks via symmetric NMF”

Latent Variable Models Nonlinear Linear Can be learned by Tensor Decomposition. Latent Variable Models Harder to Learn Noisy-OR Networks (Defined next slide) [Shwe et al.91][Jordan et al.’99] Simpler versions of RBMs

Disease-Symptom Networks Disease d Weight W Observe Symptom s m Diseases di, independent w.p. ρ Edge weight: Pr[si=0|dj=1] = exp(-Wij) QMR-DT: 570 diseases, 4k symptoms, 45k edges.

Noisy-OR 50% 50% 1-exp(-Wij) 75%

Our Results Theorem [AGMR’17a] (informal): A poly-time algorithm recovers 𝑊 with 𝑂 𝜌 𝑚 relative error in ℓ 2 -norm in each column (disease). Example: 𝜌=5/𝑚, the relative error is 𝑂 1/ 𝑚 Fewer requirements on structure of network Structure similar to previous works , faster algorithm. [Halpern-Sontag’13][Jernite-Halpern-Sontag’13] Can recover 300 diseases on synthetic data. Theorem [AGMR’17b] (informal): If the network has nice combinatorial structure (true for QMR-DT), an algorithm to recover 𝑊 with accuracy ε using poly(n,m,1/ε) samples and running time.

Topic Models vs. Noisy-OR topic = dist. over words Multiple topics: words from mixture distribution Goal: Given documents, find topics Many algorithms with guarantees disease = set of symptoms Multiple diseases: symptoms from union of symptoms Goal: Given patients, find diseases Few algorithms with guarantees Why are topic models easier to learn?

Linear vs. Nonlinear models Document with 2 topics Generate: 30% words from topic 1 70% words from topic 2 Final doc = all of the words 𝐸 𝑑𝑜𝑐 =0.3𝑡𝑜𝑝𝑖𝑐1+ 0.7𝑡𝑜𝑝𝑖𝑐2 Patient with 2 diseases Generate: symptoms from disease 1 symptoms from disease 2 Final symptoms = union of two sets 𝐸 𝑝𝑎𝑡𝑖𝑒𝑛𝑡 = 𝑛𝑜𝑛𝑙𝑖𝑛𝑒𝑎𝑟 𝑒𝑥𝑝𝑟. Our Idea: Need a way to linearize the model!

Idea: PMI and Linearization Pointwise Mutual Information matrix (PMI) matrix: PMI(𝑥,𝑦)>0⇒𝑥,𝑦 are positive correlated, and vice versa. Symptoms i,j share disease ⇒ PMIi,j > 0. Higher order terms PMI=𝜌 𝑘=1 𝑚 𝐹 𝑘 𝐹 𝑘 ⊤ + 𝜌 2 … ≈𝜌 𝑘=1 𝑚 𝐹 𝑘 𝐹 𝑘 ⊤ 𝐹𝑘=1−exp⁡(− 𝑊 𝑘 ) Claim 1: Idea: Use Taylor Expansion. Log linearizes the product

PMI Tensor Similar expression for PMI tensor: 𝑛×𝑛×𝑛 tensor, measures 3-wise correlation. Analogous to inclusion-exclusion formula. Claim 2: Similar higher order terms (systematic error) as PMI matrix.

Word-Word Correlation Plan: Tensor Decompositions with PMI [AFHKL12, AGHKT14]: tensor decomposition for topic models Word-Word Correlation Tensor Decomposition Topic Matrix 3-Word Correlation

Challenge: Only have access to tensor + systematic error. Plan: Tensor Decompositions with PMI Hope PMI Tensor Decomposition Weight Matrix W PMI Tensor Challenge: Only have access to tensor + systematic error.

Systematic Error in PMI Matrix Claim 1++: PMI∈ ℝ 𝑛×𝑛 is approximately rank-𝑚, PMI≈𝜌𝐹 𝐹 ⊤ + 𝜌 2 𝐺 𝐺 ⊤ +negligible terms where 𝐹= 1−exp (−𝑊 ), 𝐺=1− exp (−2𝑊) Question: recovering the span of 𝐹 from PMI? Attempt: using standard matrix perturbation theorem (Davis-Kahan, Wedin … ). recovery error ≲𝜌⋅ max 𝑥 ⊤ 𝐺 𝐺 ⊤ 𝑥 𝜎 𝑚 𝐹 2 𝑥 2 ≲𝜌𝑚 vacuous since 𝜌𝑚≥1 Difficulty: Matrices F and G are not well-conditioned (QMR-DT: condition number > 40, 𝑚 <80)

Relative Matrix Perturbation 𝐹,𝐺 very similar: 𝐹= 1−exp (−𝑊 ), 𝐺=1− exp (−2𝑊) Intuition: More tolerant on large singular directions. Traditional theorems (Davis-Kahan, Wedin) does not differentiate these cases. Need a new relative matrix perturbation theorem. Large Perturbation Small Perturbation

Relative Matrix Perturbation Main Lemma: recovery error ≲𝜌 max 𝑥 ⊤ 𝐺 𝐺 ⊤ 𝑥 𝑥 ⊤ (𝐹 𝐹 ⊤ + 𝜎 𝑚 𝐹 2 𝐼)𝑥 =:𝜌𝜏 On QMR-DT, 𝜏≤ 6. Provably small constant for random sparse graphs. Suffices to get good approximation for span of F. Needs to generalize to asymmetric matrices/tensors (done in paper)

Quick Summary PMI can approximately linearize a log-linear model. Better matrix/tensor perturbation results can handle systematic error. Challenge: PMI-tensor requires many samples. Next: Use structure of the disease/symptom graph to get a faster algorithm! “New practical algorithms for learning Noisy-OR networks via symmetric NMF”

Anchor words and anchor symptoms F Anchor: only one nonzero entry Symptom Not Anchor: >1 nonzero entries disease Rows = Words = Symptoms Columns = Topics = Diseases Anchor symptom: symptom that appear in only one disease. [Arora G Moitra 12, AGH+12] Efficient algorithm to learn topic models with anchor words! Difficulty: for QMR-DT, not all diseases have anchor words.

Idea: Learn these diseases first, and then remove them. Layered Structure Only a subset of diseases have anchor symptoms Idea: Learn these diseases first, and then remove them.

Layered Structure Now all remaining diseases have anchor symptoms. Can repeat the procedure T times. T = 7 suffices for QMR-DT.

Layered Structure Sequential 2-anchor condition: If all diseases have not been recovered, there is a disease with at least 2 anchor symptoms. [Halpern-Sontag’13]: requires known graph structure. [Jernite-Halpern-Sontag’13]: graph needs to be quartet-learnable. Sample complexity depend exponentially on T.

From Noisy OR to symmetric NMF Recall PMI matrix Equivalently PMI≈𝜌𝐹 𝐹 ⊤ Needs to be careful with high order terms. Focus on exact NMF: PMI=𝜌𝐹 𝐹 ⊤ Symmetric Nonnegative Matrix Factorization!

Symmetric NMF with Sequential 2-anchor High level algorithm REPEAT Find all anchor symptoms. Learn diseases with at least two anchors. Remove these diseases from the graph. UNTIL all diseases are learned.

Finding Anchor Symptoms FT = Observation: If two anchors correspond to the same disease, rows in PMI matrix are duplicates. Observation2: Try to subtract this component, no entry should become negative. Symmetric + Nonnegative

Learning the diseases and peeling off j p i F FT = j Symmetric  only need to learn a scaling. 𝐹 𝑝 =𝜆PM I 𝑗 , PMI 𝑖,𝑗 =𝜌 𝐹 𝑝 (𝑖) 𝐹 𝑝 (𝑗) Can learn the scaling from PMI 𝑖,𝑗 ! Remove disease p: subtract 𝐹 𝑝 𝐹 𝑝 ⊤ from PMI matrix.

Synthetic Experiments Fails because noise is too large for 3rd layer. Runs within 45 min (vanilla Matlab implementation) With 100m samples, the algorithm can find the correct support for the 1st layer, columns have relative error ≈ 0.01. Can identify 70% of diseases for the 2nd layer.

Thank You! Open Problems More practical algorithms for learning Noisy OR networks (esp. improve sample complexity). Better generative model for QMR-DT. (why does it have layered structure?) Learning more nonlinear models (RBM, deep belief networks, etc.) Thank You!

Additional Difficulties Do not have access to diagonal entries Solution: Partition symptoms into 3 parts, use asymmetric tensor decomposition. Traditional tensor decomposition algorithms are not robust enough Solution: Use a Sum-of-Squares approach []