Paraphrase Generation Using Deep Learning Prasanna Vaidya Co Founder DiscoveryAI
Agenda What is Paraphrase Generation? Use Cases Building Blocks Technologies Publicly Available Datasets & Compute Power Evaluation Metrics Important Research Papers Questions & Answers
What is Paraphrase Generation? Paraphrasing, the act to express the same meaning in different possible ways, is an important subtask in various Natural Language Processing (NLP) Applications. How old is your child? —> Age of your kid?
Why it is important & Use Cases Information Retrieval Conversational Systems Content Summarisation
Research Areas Recognition - Identify if two textual units are paraphrases of each other Extraction - Extract paraphrase instances from a thesaurus or a corpus Generation - Generate a reference paraphrase given a source text
Building Blocks
Word Embeddings Word embedding is a technique where words or phrases from the vocabulary are mapped to vectors of real numbers. King http://projector.tensorflow.org
Neural Networks
Limitations of Neural Networks Neural Networks don’t have memory. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Enter Recurrent Neural Nets They are networks with loops in them, allowing information to persist.
Limitations of RNNs I grew up in Pune…I speak fluent Marathi. In theory, RNNs are absolutely capable of handling such “long-term dependencies.” Sadly, in practice, RNNs don’t seem to be able to learn them.
Long Short Term Memory LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is their default behavior.
Similarity with Machine Translation Paraphrasing Task can be modelled as Machine Translation Task. How are you? —> ¿cómo estás?
Encoder Decoder Model Encoder encodes the input sequence to an internal representation called 'context vector' which is used by the decoder to generate the output sequence. The lengths of input and output sequences can be different. import seq2seq from seq2seq.models import SimpleSeq2Seq model = SimpleSeq2Seq(input_dim=5, hidden_dim=10, output_length=8, output_dim=8, depth=3) model.compile(loss='mse', optimizer='rmsprop')
Publicly Available Datasets http://paraphrase.org/#/download https://www.kaggle.com/quora/question-pairs-dataset https://www.microsoft.com/en-us/download/details.aspx?id=52398
Compute Requirements Training lasted for 32 hours with on p2.xlarge on AWS for PPDB
Evaluation Metrics BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. METEOR (Metric for Evaluation of Translation with Explicit Ordering) is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. https://github.com/jhclark/multeval
Results - How are you? how you doin ' , man uh , how are you how ya been how ya feelin ' , kid how the hell are you
Important Research Papers Neural Paraphrase Generation with Stacked Residual LSTM Networks https://arxiv.org/pdf/1610.03098.pdf Paraphrase Generation with Deep Reinforcement Learning https://arxiv.org/abs/1711.00279 A Deep Generative Framework for Paraphrase Generation https://www.cse.iitk.ac.in/users/piyush/papers/deep-paraphrase-aaai2018.pdf
Thank You! Questions? prasanna@discovery.ai @getprasannav