Fill-in-The-Blank Using Sum Product Network

Slides:



Advertisements
Similar presentations
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Advertisements

Advanced topics.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Introduction of Probabilistic Reasoning and Bayesian Networks
Presented by: Mingyuan Zhou Duke University, ECE September 18, 2009
Learning Markov Network Structure with Decision Trees Daniel Lowd University of Oregon Jesse Davis Katholieke Universiteit Leuven Joint work with:
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Conditional Random Fields
Scalable Text Mining with Sparse Generative Models
CSE 515 Statistical Methods in Computer Science Instructor: Pedro Domingos.
Function Approximation for Imitation Learning in Humanoid Robots Rajesh P. N. Rao Dept of Computer Science and Engineering University of Washington,
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Sum-Product Networks CS886 Topics in Natural Language Processing
第十讲 概率图模型导论 Chapter 10 Introduction to Probabilistic Graphical Models
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Neural Net Language Models
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Christopher M. Bishop, Pattern Recognition and Machine Learning 1.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Sum-Product Networks: A New Deep Architecture
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Distributed Representations for Natural Language Processing
Brief Intro to Machine Learning CS539
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
Learning Deep Generative Models by Ruslan Salakhutdinov
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Deep Learning for Bacteria Event Identification
Deep Learning Amin Sobhani.
Recursive Neural Networks
Neural Machine Translation by Jointly Learning to Align and Translate
Giuseppe Attardi Dipartimento di Informatica Università di Pisa
Deep learning and applications to Natural language processing
Deep Learning Workshop
Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph
Deep Learning Hierarchical Representations for Image Steganalysis
Word Embedding Word2Vec.
Word embeddings based mapping
Word embeddings based mapping
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
Resource Recommendation for AAN
Machine Translation(MT)
Chapter14-cont..
Word2Vec.
Neural networks (3) Regularization Autoencoder
Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.
Word embeddings (continued)
Discriminative Probabilistic Models for Relational Data
Attention for translation
Presented by: Anurag Paul
Word representations David Kauchak CS158 – Fall 2016.
Neural Machine Translation using CNN
Bidirectional LSTM-CRF Models for Sequence Tagging
CSC 578 Neural Networks and Deep Learning
Neural Machine Translation by Jointly Learning to Align and Translate
Leveraging Free Text Data for Decision Making in Drug Development
Presentation transcript:

Fill-in-The-Blank Using Sum Product Network Pan Pan Cheng CS 898 Final Project 2017

Outline Background Application: Fill-in-the-blank Future Work Probabilistic Graphic Model + Deep Architecture Definition Example Application: Fill-in-the-blank Data Source: Microsoft Research Sentence Completion Challenge (2011) Future Work

Probabilistic Graphic Models (PGMS) Bayesian Network Markov Network In many NLP application, Interest  Estimate the underlying distribution and Infer the following quantities: Joint Probabilities: Language Modeling  Infer how likely certain sequence of words occurs in a language, e.g. Pr(W1, W2, W3, W4, W5) in a 5-gram model Conditional Probabilities: Sentiment Analysis  Infer how likely a given statement is positive, e.g. Pr(Positive | Context) Figure source [Koller et al. 2009]

Deep Architecture Stack many layers RNN, CNN, and etc Potentially much more powerful than shallow architectures [Bengio, 2009] But Inference is even harder

Sum-Product Networks (SPNs) A Sum-Product Network S over RVs X is a rooted weighted DAG consisting of distribution leaves (network inputs), sum and product nodes (inner nodes) A leaf n defines a tractable, possibly unnormalized, distribution over some RVs in X An nonnegative weight w is associated to each edge linking a sum node n to child nodes Purposed by Poon & Domingos (2011)

SPNs vs PGMs SPN is order of magnitude faster The cost of exact inference in PGMs sometimes is exponential in the worst case  Approximate Techniques SPN PGM (DBN) Learning 2-3 Hours Days Inference < 1 second Minutes or Hours

SPNs – NNs In a classic MLP hidden layer, computing SPNs  DAG of MLPs SUM Node PRODUCT Node Figure source [Di Mauro et al. 2016]

SPNs – Example Figure source [Poon et al. 2011]

SPNs – Example Figure source [Di Mauro et al. 2016]

SPNs – Learning Parameter learning  Estimate w from data considering an SPN as a latent RV model, or as a NN. Structure learning  Build the network from data by assigning scores to tentative structures or by exploiting constraints

SPNs – Learning Handcrafted structure  Parameter learning [Poon and Domingos 2011] [Gens and Domingos 2012] Random structures  Parameter learning [Rashwan et al. 2016] Structure learning  Parameter learning (fine tuning) [Zhao et al. 2016] Learn both weight and structure at the same time [Gens and Domingos 2013;]

Experiment – Fill-in-the-blank Related to: Speech Understanding Machine Comprehension Tasks Data Source Microsoft Research Sentence Completion Challenge (2011) Data Summary Training Set: Texts selected from 522 19th century novels (~41.5 million words, 227 Mb) Test Set: 1040 sentences selected from Sherlock Holmes novels where 1 word is missing and 5 choices for the missing word

Experiment – cont’ - Texts is completely unannotated and untokenized; - Do not have the same format as the testing set;

Experiment – cont’ A single word in the original sentence has been replaced by an impostor word with similar occurrence statistics.

Experiment – cont’ Learning via Random Structure was used 2 different pre-trained word embedding libraries: Google word2vec model: 300 d/word GloVE (Global Vector for Word Representation): 50d & 100d/word Extract Sentence  Extract all possible 5 gram & 10 gram tokens for training corresponding language models using SPN For 300d embedding vector, the total size is ~ 500GB Trained on personal desktop: 16 GB DDR3 (1333 MHz), i7-2600, GTX 1060 (6GB)

Experiment – cont’ N-gram Embedding Size Training File Size Epoch Training Time 5 300 29.3GB (at most 5k samples/file) 1 ~ 3 hrs 25 ~ 14 hrs 50 69.1 GB ~ 5 hrs ~ 45 hrs 10 66 GB (at most 50k samples/file) ~ 4 hrs ~ 7 hrs 100 83 GB (at most 25k samples/file) ~ 8 hrs

Experiment – cont’ Human Accuracy: ~ 90% State-of-the-art models: * 50 – 60% (ensembling) * 40 – 50% (single model) Accuracy of SPNs: ~30% (Similar to N-gram & Word2Vec) Figure source [Woods et al. 2016]

Future Work Can we incorporate the recurrent idea to SPNs? Mazen: Recurrent Sum-Product Networks for Sequence Data Modeling How to do the structural learning in this case? Abdullah: Online structure learning for sum-product networks with Gaussian leaves How to train with discriminative learning? To learn the conditional probability Pr(Word | Context) How to build a scalable library to train SPNs? https://github.com/ppcheng/cs898-spn-experiments https://github.com/KalraA/Tachyon

Thank You !!

Valid SPN  correctly presenting a probabilistic distribution SPNs – Conditions Product: Under a Product node, no variable in one child while the negation is in another Complete: Under a Sum node, children cover the same set of variables Valid SPN  correctly presenting a probabilistic distribution Figure source [Poon et al. 2011]