“Cheap” Tricks for NLP: An “Invited” Talk Craig Martell Associate Professor Naval Postgraduate School Director, NLP Lab.

Slides:



Advertisements
Similar presentations
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
I. S. W. M.It Starts With Me Cyber Communication Awareness.
What is Cyber-Bullying? Cyber-Bullying is the use of technology to harass, threaten, embarrass or target another person. Social Networks Instant.
INTERNET SAFETY and Social Networking
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.
Community Level 7. Hey Everybody My name is Tek. I ’ m going to be your guide today! I ’ m a part of i-SAFE, and we are concerned with helping you to.
How to keep your kids safe online
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Introduction to Machine Learning Approach Lecture 5.
Internet Safety By: Caitlyn Stevenson. Information about Internet Safety  The internet is a huge deal, any child that can press a few letters on a keyboard.
ELN – Natural Language Processing Giuseppe Attardi
Teenangel Gabriella. AIM is an instant messaging system. You have to be thirteen years old to use AIM. Besides sending messages, AIM is also used to tell.
Let’s talk about Internet Safety!
Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.
Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology.
Internet...? Definition: The Internet is millions of computers around the world connected to each other. Explanation: When you're on the Internet, your.
“It is like, totally anonymous, so no one worries about getting in trouble. Lots of kids would never do this stuff in the ‘real world.’” —13 year old boy.
We know you’re sick of Power Points! We invite you to help us find a better method. Until then, this is what we’ve got.
BlogWall at Kent Ridge MRT Station Janaka Prasad 09/06/2008.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Lecture 17 Page 1 CS 236 Online Network Privacy Mostly issues of preserving privacy of data flowing through network Start with encryption –With good encryption,
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
Machine Learning.
Amy Dai Machine learning techniques for detecting topics in research papers.
Online Safety and You!. Introduction The good and the bad about the internet Protecting your Personal Information –Password protection Safety.
David McDonald BBN Technologies Flexibility counts more than precision.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Author Age Prediction from Text using Linear Regression Dong Nguyen Noah A. Smith Carolyn P. Rose.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Effective Information Access Over Public Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.
Internet Safety Tips.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
1 Language, Science and Data Science Kathleen McKeown Department of Computer Science Columbia University.
Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science.
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
MXit is a mobile application that allows people to chat to their friends at a much cheaper rate than normal text messages. You can contact anyone anywhere.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
Text Annotation By: Harika kode Bala S Divakaruni.
To inform and engage citizens in the governmental, civic, and cultural affairs of Seattle through compelling use of television, Internet, and other media.
CS 4705 Corpus Linguistics and Machine Learning Techniques.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Negotiating meaning K: But I manage to study. F: I don’t understand manage. A: She is able to study. K:She makes time to study.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Aristotle, a great thinker, once said: “We are what we repeatedly do.”
Cybersafety 4 Kids: Smart Moves in an Online World.
What Parents Should Know About Social Networking.
Language Identification and Part-of-Speech Tagging
A Simple Approach for Author Profiling in MapReduce
Bernalillo Public Schools
Text Analytics Giuseppe Attardi Università di Pisa
Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing
CSCI 5832 Natural Language Processing
Extracting Recipes from Chemical Academic Papers
PURE Learning Plan Richard Lee, James Chen,.
CS224N Section 3: Corpora, etc.
Presentation transcript:

“Cheap” Tricks for NLP: An “Invited” Talk Craig Martell Associate Professor Naval Postgraduate School Director, NLP Lab

Overview We’ve been doing work on microtext since before it was “microtext”. About NPS NPS Chat Corpus (v1 and v2?) – Overview – Goal (Jane Lin, NSA) – Age Detection Task (MAJ Jenny Tam, USA) POS and Dialogue Act tagging: using Treebank to bootstrap (Lt. Col. Eric Forsyth, USAF) But do we really even need to POS-tag? (CAPT James Hitt, USN) – Getting by “on the cheap” Authorship detection in Twitter (LT Sarah Boutwell, USN) Good scientific goals for the community (??)

NPS NLP Lab NPS is both a university and part of the DoD As a university, we work on the same types of sponsored research as civilian universities – DARPA *, IARPA, MURIs, NSF, etc. – Standard competitive process – Standard academia/industry expectations for results – Same tenure and promotion process As a part of the DoD, we do work more directly for sponsors: – DoD, DARPA *, NRO, NSA, etc. – Depending on the type of money, results need to be more operationally applicable – We have had some cool results using “cheap” tricks that could point to more “normal” academic research

Some Recent and Current Work IARPA SCIL – Persuasion Detection – Sub-group Detection – In forums, chat, etc. (“microtext”) – With UMD, UCSC, and Temple DoD, etc. – Topic Detection in IRC Chat (Adams 2008) – Authorship “signal boosting” with large author sets Any boost is remarkably useful to analysts – Project away topic signal from documents for cleaner authorship signal (topic does most of the work) – L1 detection from English-L2 documents. – “On-phone” NLP (above and more) Accuracy vs Computational Power

The NPS Chat Corpus, V1 Gather 495,000 posts in age-based rooms – According to the terms of service of the chat service To abide by the privacy act, we hand anonymized 10,000 posts, tagged them for dialogue act and part of speech – Go to Web PageWeb Page

The NPS Chat Corpus, V2? We have also gathered data to aid in doing conversational thread extraction: – Essentially, we want to cluster posts according to what conversation they’re in – Not necessarily mutually exclusive clusters We gathered data similar to that gathered by Elsner and Charniak at Brown – They gathered IRC data from Linux tech help – We added iPhone and Python tech help, and Physics Q&A – It has all been hand “clustered” for conversations – Working with UCSC CS (Lyn Walker) and Linguistics (Pranav Anand) to augment the annotation to include dialogue act and “attachment” instead of clusters.

First Use: Age Detection Second Youth Internet Safety Survey (2005) (YISS-2) – Decrease in youths receiving solicitations – Number of dangerous sexual overtures/aggressive solicitations has not declined – In 35% of the aggressive episodes, youths did not think the solicitations were serious enough to tell anyone – Only 7% of the aggressive solicitations were reported to law enforcement, ISP, or other authority Need for an automated system that can recognize adults conversing with teens to alert parents of possible inappropriate conversations

Tam – Chat Classification NPS Chat Corpus (Talk City Chat Data) – Teens, 20s, 30s, 40s, 50+ Perverted Justice (IM chat logs) – Pseudo Victims (adults posing as minors) – Convicted criminals (solicitation of minors) Binary Classification – teens vs. adults – teens vs. specific age group – teens vs. pseudo victims (similarity between actual teens to adults pretending to be teens) – criminals vs. teens (looking for minors soliciting minors) – criminals vs. pseudo victims Classification Tool – Linear Support Vector Machine with different slack variables Result: 80-90% success at detecting Teens from Adults. But the most important is detecting Teens from 20s. >90% Current state of the art in the field!

Forsyth – Dialogue Act/POS Tagging An experiment in cross-domain NLP We wanted to POS tag chat (Lt Col Eric Forsyth, USAF) Lex. Bigrams → Bigrams → Lex. Unigrams → Unigrams → MLE from training data WSJ train, chat test: 57.4% accuracy – Not surprising. Chat is not like WSJ Treebank train, chat test: 65.8% accuracy – Includes ATIS, Switchboard; chat is somewhat speech like Boot strapped/hand corrected POS tags for 10,000 posts Chat train, chat test: 73.7% But, add 10,000 chat posts to Treebank: 87.1% – Using HMM tagger trained on combo: 90.8% Using these parts of speeches tags as part of the input, we can dialogue act tag at 83.2% accuracy.

Hitt – Dialogue Act/POS Tagging But does POS tagging matter for dialogue act tagging (our actual goal in chat, sms, etc). Sure, but it doesn’t have to be that good Instead of using chat at all, we (CAPT James Hitt, USN) simply generated the MLE for each word string (no wsd) from pre-existing resources (Treebank and Brown combined). Just using these “cheap” parts of speech we get: 83.23%

L1 Language Identification Using International Corpus of Lerner English For each author L2 = English (except for native speaker control group) Texts in English Task: Guess L1 Using character 3-grams, we (LT Charles Ahn, USN) got: 81.3%

L1 Language Identification

CPOS and L1 Identification Interestingly CPOS n-grams works very well here too: Cells contain average counts of documents over 26 trials ML: Multi-class Logistic Regression

Boutwell – Authorship Detection in Twitter Hot off the presses Built a “social network” from the Twitter garden hose Use it to simulate SMS messages within the group If my phone is stolen, can it tell that it isn’t me writing SMS? So, what do we need to do authorship detection over “SMS” – Doesn’t seem to be a lot of authorship signal in SMS Well, not in one, but in 23 there is – If we have a stream of 23 messages, we got 90% accuracy over 10 authors. – Authors are consistent in how they deal with the constraints? – More error/success analysis needed

Research to be explored Can we build a better scientific understanding of different domains of text and develop a theory of what will be useful from pre-existing domains? What will be needed from the new domain? How much can we actually do with as little as possible? – Do we need to parse? Should we expand (e.g., ur), or generate new grammars I argue we build new models sooner rather than later – How do we get parallel corpora? – How do we get best practices for mechanical turk?