That's What She Said: Double Entendre Identification Kiddon & Brun 2011.

Slides:



Advertisements
Similar presentations
Imbalanced data David Kauchak CS 451 – Fall 2013.
Advertisements

Distant Supervision for Emotion Classification in Twitter posts 1/17.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Stock Volatility Prediction using Earnings Calls Transcripts and their Summaries Naveed Ahmad Aram Zinzalian.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Recommender systems Ram Akella November 26 th 2008.
Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text Soo-Min Kim and Eduard Hovy USC Information Sciences Institute 4676.
Mining and Summarizing Customer Reviews
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Machine learning system design Prioritizing what to work on
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
1 Multi-Perspective Question Answering Using the OpQA Corpus (HLT/EMNLP 2005) Veselin Stoyanov Claire Cardie Janyce Wiebe Cornell University University.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Copyright © 2013 by Educational Testing Service. All rights reserved. 14-June-2013 Detecting Missing Hyphens in Learner Text Aoife Cahill *, Martin Chodorow.
Chapter 23: Probabilistic Language Models April 13, 2004.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Class Imbalance in Text Classification
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Language Identification and Part-of-Speech Tagging
A Simple Approach for Author Profiling in MapReduce
Name: Sushmita Laila Khan Affiliation: Georgia Southern University
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Evaluating Classifiers
Effects of User Similarity in Social Media Ashton Anderson Jure Leskovec Daniel Huttenlocher Jon Kleinberg Stanford University Cornell University Avia.
Text Mining CSC 600: Data Mining Class 20.
Text Based Information Retrieval
CRF &SVM in Medication Extraction
Erasmus University Rotterdam
University of Computer Studies, Mandalay
Using UMLS CUIs for WSD in the Biomedical Domain
Category-Based Pseudowords
Extracting Semantic Concept Relations
Data Integration for Relational Web
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Introduction Task: extracting relational facts from text
iSRD Spam Review Detection with Imbalanced Data Distributions
Lecture 13 Information Extraction
Statistical n-gram David ling.
Support Vector Machines
CS246: Information Retrieval
Introduction to Text Analysis
Text Mining CSC 576: Data Mining.
University of Illinois System in HOO Text Correction Shared Task
Word embeddings (continued)
Naïve Bayes Text Classification
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
Introduction to Sentiment Analysis
Stance Classification of Ideological Debates
Statistical NLP: Lecture 10
Presentation transcript:

That's What She Said: Double Entendre Identification Kiddon & Brun 2011

Introduction Double entendre An expression that can be understood in two ways: Innocuous and risqu é Double entendre identification has not been researched Very difficult problem – need deep semantic and cultural understanding for most

Introduction “That's what she said” jokes Subset of double entendres Repopularized by “The Office” TV show Internet meme Examples Late-evening basketball practice: “I was trying all night, but I just could not get it in!”

Introduction TWSS as a metaphor identification problem Analogical mapping between domains Terminology of source domain used to describe situations in target domain Terms in source domain are literal and the same terms in target domain are nonliteral

Introduction Other research in computational metaphor identification Learning selectional preferences of words in multiple domains to identify nonliteral usage SVMs trained on labeled data to distinguish metaphoric language from literal language

Method Applied methods from metaphor identification Mappings between two domains  Innocuous source and erotic target Selectional preferences  Identifying adjectival selectional preferences of sexually explicit nouns to other nouns  Examine relationship between structures in erotic domain and nonerotic context Goal for the domain is high precision (correctly identified TWSS)  Low recall tolerated (better to miss an opportunity than to make a socially awkward mistake)

Method DeviaNT: Double Entendre via Noun Transfer SVM model that uses features which model TWSS characteristics  TWSSs likely to contain nouns which are euphemisms for sexually explicit nouns  TWSSs share common structure with sentences in erotic domain

Method Created word classes for their algorithm SN is a set of sexually explicit nouns  Manually selected 76 nouns predominantly used in sexual contexts  9 categories based on which sexual object, body part, or participant they identify SN - ⊂ SN set likely targets in euphemisms  |SN - | = 61 BP is the set of body part nouns  Approximation contains 98 body parts

Method Corpora for comparison Source domain: Erotic Corpus  Textfiles.com/sex/EROTICA  1.5 million sentences  All unparsable text, etc removed  Parsed with Stanford Parser Target domain: The Brown Corpus  Already tagged!

Method Corpora modified to be more generic All numbers replaced with CD tag Proper nouns given NNP tag Nouns that are elements of SN tagged as SN

Method NS(n) = Noun Sexiness function For each noun, Adjective Count Vector contains freq. of each adjective modifying noun in the union of erotica and Brown corpora NS(n) = maximum cosine similarity over each noun in SN- using tf-idf weights of the nouns' adjective count vectors Nouns occurring <200 times, <50 times w/adjs, or were associated with 3x a many adj that never occurred with nouns in SN were assigned (SO not sexy!). Examples of nouns with high NS are “rod” and “meat”

Method AS(a) = Adjective Sexiness function Measures how likely an adjective a is to modify a noun in SN Relative frequency of a in sentences with at least one noun in SN Example adjectives with high AS are “hot” and “wet”

Method VS(v) = Verb Sexiness function Measures how likely a VP is to appear in an erotic than nonerotic context S E = set of sentences in erotic corpus S B = set of sentences in Brown VP v = substring containing verb bordered on each side by the closest noun or a pronoun. If condition not met, verb is endpoint of v.

Method VS(v) = Verb Sexiness function VS(v) is defined as approx. probability of v appearing in erotic and nonerotic context with counts in S E and S B such that P(s ∈ S E ) = P(s ∈ S B ) VS(v) is the probability that (v ∈ s) implies s is in an erotic context

Method Features DeviaNT uses two categories of features in identification Noun Euphemisms Structural Elements

Method Noun Euphemisms Does s contain a noun∈SN? Does s contain a noun∈BP? Does s contain a noun such that NS(n) = ? Average NS(n) for all n∈s such that n∉S N ∪S B

Method Structural Elements Does s contain a verb that never occurs in SE? Does s contain a VP that never occurs in SE? Average VS(v) over all VP v∈s Average AS(a) for all a∈s Does s contain a such that a never occurs in s∈S N ∪S B with a n∈SN?

Method Structural Elements Also identifies Basic Structure by:  Number of non-punctuation tokens  Number of punctuation tokens  {0, 1, 2+} for each pronoun and each POS tag, number of times it occurs in s  Category of subject (noun, pronoun, etc.)

Method SVM classifier Used default parameters with option to fit logistic regression curves to outputs for precision/recall analysis MetaCost metaclassifier Reclassify training data to produce a single cost- sensitive classifier Cost of false positive 100x that of false negative  Being correct more important than missing false negatives

Evaluation Goal of evaluation to show that their features can compete with baselines Training data Positive:  2001 examples from twssstories.com, a website of user- submitted TWSS jokes Negative:  2001 sentences, 667 from each site  textsfromlastnight.com – racy texts  fmylife.com/intimacy – love life stories  wikiquotes.org- quotes from famous American speakers/film

Evaluation Baseline Naïve Bayes classifier on unigram features SVM on unigram features SVM on unigram and bigram features MetaCost versions of each DeviaNT with only Basic Structure features SVM models used same parameters and kernel functions as DeviaNT Also tested DeviaNT with unigram features, but it did not improve performance

Results Baseline DEviaNT & Basic Structure have highest precision DEviaNT >71.4% Unigram SVM w/o MetaCost maxed at 59.2%

Results Compared sentences DEviaNT, Basic Structure, and Unigram SVM w/o MetaCost most confidently classified as TWSS DEviaNT returned 28 all tied for most likely to be a TWSS.  20 were true positives  2 of 8 false positives were actually TWSS – such as “Yeah, but his hole really smells sometimes.”

Results Compared sentences DEviaNT, Basic Structure, and Unigram SVM w/o MetaCost most confidently classified as TWSS Basic Structure returned 16 sentence  11 true positives  Of these, 7 were in DEviaNT's sure set

Results Compared sentences DEviaNT, Basic Structure, and Unigram SVM w/o MetaCost most confidently classified as TWSS Unigram SVM w/o MetaCost  Returned 130 sentences, 77 true positives

Results Compared sentences DEviaNT, Basic Structure, and Unigram SVM w/o MetaCost most confidently classified as TWSS DEviaNT was able to identify TWSSs that identified noun euphemisms like “Don't you think these buns are a little too big for this meat?”, which Basic Structure missed. DEviaNT has much lower recall than Unigram SVM, but it accomplishes goal of high precision If training data was balanced subset of test data, DeviaNT's precision would be 0.995

Conclusion Experiments indicate euphemism and erotic domain structure features contribute to improving precision of TWSS identification Could be possible to generalize this technique for other types of double entendres and humor