Qual Presentation Daniel Khashabi 1. Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond.

Slides:



Advertisements
Similar presentations
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Advertisements

Machine learning continued Image source:
On the Hardness of Evading Combinations of Linear Classifiers Daniel Lowd University of Oregon Joint work with David Stevens.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Distributed Representations of Sentences and Documents
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Radial Basis Function Networks
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Modeling and Finding Abnormal Nodes (chapter 2) 駱宏毅 Hung-Yi Lo Social Network Mining Lab Seminar July 18, 2007.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Active Learning for Class Imbalance Problem
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Qual Presentation Daniel Khashabi 1. Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond.
Learn to Comment Lance Lebanoff Mentor: Mahdi. Emotion classification of text  In our neural network, one feature is the emotion detected in the image.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
Expressing Implicit Semantic Relations without Supervision ACL 2006.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
1 LING 696B: Midterm review: parametric and non-parametric inductive inference.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Recognition Using Visual Phrases
OR Chapter 7. The Revised Simplex Method  Recall Theorem 3.1, same basis  same dictionary Entire dictionary can be constructed as long as we.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Gaussian Processes For Regression, Classification, and Prediction.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Learning Analogies and Semantic Relations Nov William Cohen.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Vector Semantics Dense Vectors.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Data Mining and Text Mining. The Standard Data Mining process.
From Frequency to Meaning: Vector Space Models of Semantics
Semi-Supervised Clustering
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Sentiment analysis algorithms and applications: A survey
Semantic Processing with Context Analysis
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Vector-Space (Distributional) Lexical Semantics
Distributed Representation of Words, Sentences and Paragraphs
Recognizing Partial Textual Entailment
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
CS4670: Intro to Computer Vision
Text Categorization Berlin Chen 2003 Reference:
Word embeddings (continued)
Topic: Semantic Text Mining
Stance Classification of Ideological Debates
Presentation transcript:

Qual Presentation Daniel Khashabi 1

Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond Words: Supervised Learning of Analogy and Paraphrase, TACL,

Motivation  Developing tools for word-similarity  Useful in many applications  For example paraphrase detection: The Iraqi foreign minister warned of disastrous consequences if Turkey launched an invasion. Iraq has warned that a Turkish incursion would have disastrous results. I can be there on time. I can’t be there on time. 3

Motivation (2)  Developing tools for word-similarity  We need to solve easier problem  For example, SAT test:  Very important for understanding hierarchies of word semantics 4

Motivation (3)  Developing tools for word-similarity  We need to solve easier problem  Compositional behavior of the word semantics SearchEngineSearch Engine 5

Motivation (4)  Developing tools for word-similarity  We need to solve easier problem  Compositional behavior of the word semantics  For example: understanding noun-modifier questions 6

Designing semantic features  Feature engineering  An important step in semantic modeling of words  The rest is just learning the task in a fully supervised fashion  Type of the features:  Log-Frequency  PPMI (Positive Pointwise Mutual Information)  Semantic Similarity  Functional Similarity 7

Feature generation  Features  Log-Frequency  PPMI (Positive Pointwise Mutual Information)  Semantic Similarity  Functional Similarity  All features are generated on a collection of documents  Of size words  Definition  Word-Context:  Three word-context matrices  Rows correspond to words/phrases in Wordnet Words And Phrases context Word(Right) Context(Left) Context 8

Pointwise Mutual Information  PMI (Pointwise Mutual Information):  PPMI(Positive PPMI)  One useful definition for probabilities  The ratio of the times a context appears with a words 9

Pointwise Mutual Information (2)  Only the words or phrases that exist in the Wordnet  And appear with frequency more than 100 in the corpus  Find words to the left and right of the word (context) in phrases:  Create word-context frequency matrix F:  is the number of times appear in context. F words context Table shows forty paradigm words 10

Pointwise Mutual Information (3)  Create word-context frequency matrix F:  Create PPMI matrix: F words context X words context 11

Pointwise Mutual Information (4)  Given PPMI matrix:  Word in the i-th row  Word in the j-th column   For an n-tuple one can generate PPMI features X words context 12

A vector space for domain similarity  Designed to capture the topic of a word.  Construct a frequency matrix:  Rows: correspond to words in Wordnet  Columns: Nearby nouns  Given a term search the corpus for it  Choose a “noun” closest to the right/left of  And increment by one. D words Nearby nouns 13

A vector space for functional similarity  Exactly the same as the domain similarity measures  Except that it is made using the “verbal” context  Construct a frequency matrix:  Rows: correspond to terms in Wordnet  Columns: Nearby verbs  Given a term search the corpus for it  Choose a “verbs” closest to the right/left of  And increment by one. S words Nearby verbs 14

Using semantic and functional similarity  Given the frequency matrix:  Keep the lower-dimensional representation  Keep the values corresponding to the k biggest eigenvalues  Given word, is the corresponding vector  is used to tune sensitivity with respect to eigenvalues  Given and to find:  find corresponding vectors  find cosine distance between the vectors words Nearby verbs 15

5-choice SAT tests  374 five-choice SAT questions  Could be converted into 5 4-tuples:  Each positive 4-tuple could be converted to: 16

Results on 5-choice SAT  The top ten results with the SAT analogy questions.  SuperSim answers the SAT questions in a few minutes  LRA requires nine days not significantly different according to Fisher’s exact test at the 95% confidence level 17

SAT with 10 choices  Adding more negative instances  In general if is positive is negative  For example: Positive: Negative:  This generates 5 more negative instances 18

SemEval-2012 Task 2  Class and subclasses labels + examples  Gather using Mechanical Turk:  75 subcategories  Average of 41 word-pairs per subcategories CLASS-INCLUSION,Taxonomic 50.0 "weapon:spear" "vegetable:carrot" "mammal:porpoise" "pen:ballpoint" "wheat:bread" 19

SemEval-2012 Task 2  SuperSim Trained on 5-choice SAT and tested on SemEval data  It gives the best correlation coefficient 20

Compositional similarity: the data  Noun-modifier question based on WordNet  Create tuples of the form:  Example:  Any question gives one positive instance and six negative instance 21

Compositional similarity: 7-choices questions  680 questions for training  1,500 questions for testing  Any question gives one positive instance and six negative instance  And train a classifier given the tuple for probabilities  The holistic approach is noncompositional.  The stem bigram is represented by a single context vector  As if it were a unigram. Total of 2,180 questions 22

Compositional similarity:14-choices questions  Any question gives one positive instance and six negative instance  Positive instance:  Negative instance: e.g. word fantasy wonderland  This gives 7 more negative instances (14-choices) 23

Compositional similarity: ablation experiment  Analyzing effect of each feature type on the 14-choice test  PPMI features are the most important 24

Compositional similarity: closer look at PPMI  PPMI features for into three subsets:  For example for :  subset are more important. 25

Compositional similarity: holistic training  A holistic training data  Extract noun-modifier pairs from WordNet  Call a pseudo-unigram and treat it as unigram  Use the components as distracters 26

Compositional similarity: holistic training(2)  Training on the holistic questions: “Holistic”  Compared with the standard training  Test is the standard testing  There is a drop when training with the holistic samples  Not very clear, but seems to be because of the nature of the 27

References  Some figures from:  Hinton, Geoffrey E., et al. "Improving neural networks by preventing co- adaptation of feature detectors." arXiv preprint arXiv: (2012). 28

Past research  My own line of research  Intersection of: Machine Learning, NLP, Vision.  Past:  Multi-agent systems (Robocup 2009)  Object tracking (2009; internship)  Neural Networks (SMC-2011)  Recursive Neural Networks for NLP (2013)  Gaussian Processes (2013)  Regression Tree Fields for Joint Denoising and Demosaicing (2013; internship) 29

Current Line of Research  Many models in natural language use hard constrains:  Example: Dependency Parsing 30

Current Line of Research  Many models in natural language use hard constrains:  Example: Sentence Alignment (Burkett & Klein, 2012) 31

Current Line of Research  Many models in natural language use hard constrains:  Example: Sentence Alignment (Burkett & Klein, 2012)  Hard constraints are big challenges, since they create large combinatorial problems  Instead we can separate the inference on hard constraints and soft constraints  Soft inference: Expectation Propagation  Hard inference: Projection with respect to hard constraints Complicated Distribution Known distribution Constrained distribution EP projection Constrained projection 32

Current Line of Research  Expectation Propagation:  A method to perform Bayesian inference in graphical models  Works based on projection of the expectations EP soft-projection 33

Current Line of Research  Expectation Propagation:  A method to perform Bayesian inference in graphical models  Works based on projection of the expectations EP projection. Constrained projection: 34

SVM  Primal form:  Relaxed form:  Dual form: 35

Proof: 36

Proof: 37

Proof: Hinge regression closed form  Suppose we want to minimize: 38

Proof: Hinge regression closed form  Proof: 39

Proof: Hinge regression closed form  Proof: 40

Fisher’s test ?  Statistical significance test 41