Detecting flames and insults in text A Mahmud, KZ Ahmed, M Khan - dspace.bracu.ac.bd INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING 2008.

Slides:



Advertisements
Similar presentations
Tag-Questions or Question Tags
Advertisements

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
Statistical NLP: Lecture 3
Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.
Playing the Telephone Game: Determining the Hierarchical Structure of Perspective and Speech Expressions Eric Breck and Claire Cardie Department of Computer.
A Model for Learning Words by Crawling the Web Jeff Thomson, Sygys.com Rex Gantenbein, University of Wyoming 1CAINE November 2009.
Annotating Expressions of Opinions and Emotions in Language Wiebe, Wilson, Cardie.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
XHTML1 Building Document Structure. XHTML2 Objectives In this chapter, you will: Learn how to create Extensible Hypertext Markup Language (XHTML) documents.
Stemming, tagging and chunking Text analysis short of parsing.
DS-to-PS conversion Fei Xia University of Washington July 29,
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
Matakuliah: G0922/Introduction to Linguistics Tahun: 2008 Session 10 Syntax 1.
CSC 160 Computer Programming for Non-Majors Chapter 3: Programs are Functions Plus Variable Definitions Prof. Adam M. Wittenstein
CONVERSE Intelligent Research Ltd. David Levy, Bobby Batacharia University of Sheffield Yorick Wilks, Roberta Catizone, Alex Krotov.
Introduction to Machine Learning Approach Lecture 5.
CAS LX 502 Semantics 3a. A formalism for meaning (cont ’ d) 3.2, 3.6.
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
MECHANICS OF WRITING C.RAGHAVA RAO.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
Electronic Commerce COMP3210 Session 4: Designing, Building and Evaluating e-Commerce Initiatives – Part II Dr. Paul Walcott Department of Computer Science,
Searching: Binary Trees and Hash Tables CHAPTER 12 6/4/15 Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education,
LEARNING DECISION TREES Yılmaz KILIÇASLAN. Definition - I Decision tree induction is one of the simplest, and yet most successful forms of learning algorithm.
A Language Independent Method for Question Classification COLING 2004.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
Object-Oriented Modeling: Static Models. Object-Oriented Modeling Model the system as interacting objects Model the system as interacting objects Match.
Rules, Movement, Ambiguity
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Chapter 3 Describing Syntax and Semantics
Deep structure (semantic) Structure of language Surface structure (grammatical, lexical, phonological) Semantic units have all meaning components such.
Multilingual Opinion Holder Identification Using Author and Authority Viewpoints Yohei Seki, Noriko Kando,Masaki Aono Toyohashi University of Technology.
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
What does ‘miscellany’ mean? What kind of ‘miscellany’ does this grammar unit involve?
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Spring 2006-Lecture 2.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Copyright © 2012 Pearson Education, Inc. Chapter 20: Binary Trees.
Restrictions Objectives of the Lecture : To consider the algebraic Restrict operator; To consider the Restrict operator and its comparators in SQL.
Overview of Previous Lesson(s) Over View 3 Model of a Compiler Front End.
Understanding Naturally Conveyed Explanations of Device Behavior Michael Oltmans and Randall Davis MIT Artificial Intelligence Lab.
NATURAL LANGUAGE PROCESSING
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
Chapter 7 Trees_ Part2 TREES. Depth and Height 2  Let v be a node of a tree T. The depth of v is the number of ancestors of v, excluding v itself. 
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 20: Binary Trees.
Operational Semantics Mooly Sagiv Reference: Semantics with Applications Chapter 2 H. Nielson and F. Nielson
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
LECTURE 10 Semantic Analysis. REVIEW So far, we’ve covered the following: Compilation methods: compilation vs. interpretation. The overall compilation.
Tag-Questions or Question Tags
Topic 2: binary Trees COMP2003J: Data Structures and Algorithms 2
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Descriptive Grammar – 2S, 2016 Mrs. Belén Berríos Droguett
A Simple Syntax-Directed Translator
Binary Tree and General Tree
Chapter 20: Binary Trees.
Chapter 21: Binary Trees.
Lecture 7: Introduction to Parsing (Syntax Analysis)
CS 3304 Comparative Languages
CS 3304 Comparative Languages
COMPILER CONSTRUCTION
Presentation transcript:

Detecting flames and insults in text A Mahmud, KZ Ahmed, M Khan - dspace.bracu.ac.bd INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING 2008

Introduction Some peoples take it as a fun to use personal attacking or insulting messages for on-line communication. – Eg. ‘wiki vandalism’ If an automated system would help a user for distinguishing flames and information. – user can decide whether or not to read that article before.

Introduction Some messages can contain insulting words or phrases but still they are considered as factual information. – ‘X is an idiot’ v.s. ‘Y said that X is an idiot’ Normal text searching methods or looking for obscene expressions will annotate both of them as flame. – So, they outline a sophisticated sentence classification system using Natural Language Processing.

Related Work Done A flame recognition system is Smokey (Spertus, 1997) – Look for insulting words and syntactic constructs – Use semantic rules to produce features. – Use Quinlan’s C4.5 decision-tree generator to classify the message as a flame or not. – Correctly categorize 64% of the flames and 98% of the non-flames in a test set of 460 messages.

Our contrast with Smokey Smokey’s semantic rules are some classification rules – Match some patterns or the syntactical positions of word sequences in a sentence. – This work tries to extract the semantic information from general semantic structure to interpret the basic meaning of a sentence. Smokey is message level classification, but our system is sentence level classification. They didn’t include any sociolinguistic observation or any site-specific information.

Methodology Annotation scheme in contrast to subjective language Subjective language is language used to express private states in the context of a text or conversion (Wiebe et al., 2004) Automatic subjectivity analysis would also be useful to perform flame recognition. (Wiebe et al., 2004) Flames are viewed as extreme subset of subjective language (Martin, 2002). We are considering neither contexts nor surroundings. – Annotate any speech event such as said, told etc as a ‘factive’ event. – Example: Mary said, “John is an idiot.”

Methodology In our annotation scheme, root verb said is ‘factive’ at the corresponding dependency structure. factive

Methodology Preprocessing Two constraints for meaning interpretation from the parser output: 1.It is easier to extract the best semantic information from a dependency structure of a simple sentence. – split up a sentence into its corresponding clauses 2.In a simple sentence, an event or verb must follow its corresponding subject. – If this order is reversed, we need to swap.

Methodology The steps of our preprocessing part: 1.Separate each sentence one per line. 2.Replace the factive event ‘according to’ by ‘accorded that’ and swap the subject and the event. “ According to Mary, john didn’t go downstairs. “  “Mary accorded that John didn’t go downstairs. ”

Methodology 3.Punctuation marks (“”) are used to give a unit scope of speaker’s speeches. “John is waiting for the lift. He didn’t go downstairs,” Mary replied while talking with Lisa. Mary replied, “John is waiting for the lift. He didn’t go downstairs.” Mary replied while talking with Lisa. “John is waiting for the lift,” replied Mary, adding, “He didn’t go downstairs.” Mary replied, “John is waiting for the lift.” adding, “He didn’t go downstairs.” 4.Tag each sentence using stanford-parser.

Methodology 5.Separate each sentence into clauses by the clause separators. For some separators we need to have some special considerations:, (comma): We just need to check whether it separates two clauses. If so, then split the sentence, otherwise not. After separating, some adjustment will be taken if needed

Methodology (dash), -- (double dash): Example: We will not tolerate it anyway, because we have to win the match - said John Smith yesterday. The above example shows those first and second clauses (separated by ‘comma’) are the speeches of John Smith. Put punctuation marks (“) at the beginning of the first clause and at the end of the previous clause of the clause Then do the adjustment for punctuation marks as described in step 3.

Methodology and: Like comma see whether it separates two clauses or not. who and which: Example: The speaker here is John Smith, who is also president of the club. Result: The speaker here is John Smith John Smith is also president of the club. This process is also same for the separator.

Methodology Note that, in this preprocessing section, every separator is kept in angle brackets before each separated clause. The preprocessing tasks in this section, however, are all predetermined.

Methodology Processing We need to do some preprocessing job each time by manipulating a stack before and after processing of each clause. After traversing each dependency tree we are getting some nested sources, which are agent, experiencer with their corresponding events or verbs

Methodology Stack Manipulation We are checking the first word of that clause whether it is a verb. – If verb is found, check whether it is a new sentence, or whether this clause was separated by or. If the checking returns true then take the last agent from the stack not the experiencer. – Separators while and because, we call them scope detachers. – For any other separators take the experiencer.

Methodology – Example Mary said John is an idiot talking with Lisa. – The second clause: “talking with Lisa” starts with a verb talking and separated by. Then we should take the last agent Mary. This clause will be: “Mary talking with Lisa.”

Methodology – In case of any other separators, such as, the last experiencer John should be taken here. – If the next sentence or clause within a scope of punctuation marks, then detach all the scopes just after the scope opener. Example: Mary said, “I like fish and vegetables. I hate meat.” Detach all of the scopes except the scope opener Mary. Agent: Mary Event: said Agent: I Event: like Agent: I Event: hate Agent: Mary Event: said

Methodology – If the next sentence is a separated clause but not within punctuation marks check the separator and detach scopes reversibly until an agent is found or the stack becomes empty. If an agent is found and the separator is a ‘scope detacher’, detach that agent. Example: Mary told that Lisa said that John is an idiot doesn’t know any behavior.

Methodology Marking Phase First make all insulting phrases to one word by putting a ‘-‘ between words. – Ex: get a life will be get-a-life. Next mark each word in a sentence if that belongs to any of the following categories. All potential insulting elements are marked with a ‘*’.

Methodology * : Any insulting phrase such as get-a-life, get-lost etc. * : Any insulting word: stupid, idiot, nonsense, cheat etc. * : If a human being is compared to these objects such as donkey, dog etc. : Any word refers to human being, such as: he, she, we, they, people, Bangladeshi, Chinese.

Methodology : These are the personal attributes of human being such as behavior, manner, character etc. : All are the speech events such as said, told, asked etc. In this context insults, insulted are also factive event. : These verbs are used to evaluate a human being’s personal attribute such as know, show, have, has, expressed etc.

Methodology : All modifier verbs: should, would, must etc. : These auxiliary verbs are used to compare a human being with the comparable. These are is, are, was, and were. Each word will be marked with its ‘tag’ property. – For example the word behave will be marked as behave/VB.

Methodology Tree Annotation The tree is built from the parser output by incorporating those categories at the marking phase as Boolean properties of each node. We also incorporate three basic properties for each node: – Label - The word itself – Tag - Part of speech of the word – edgeFromParent- Relation between a node and its parent node.

Methodology Detection A set of predefined rules is applied for each node while traversing the tree. While visiting a node we must have two elements: 1. The root node of the current sub-tree, which is being visited. 2.Relation to the root, that means which sub tree we are traversing.

Methodology Rules: 1.When the root verb is ‘factive’ – check whether its subject’s ‘tag’ is NNP (Proper Noun) or PRP (Personal Pronoun) – and ‘edgeFromParent’ property doesn’t indicate it is a passive subject – then this subject will become an agent. – In any other cases it will be an experiencer. – Ex: Peoples say, “We are democratic.” In this example, Peoples is an experiencer because its ‘tag’ is NNS (Common Noun-Plural)

Methodology 2.If a dependency structure doesn’t contain a verb at the root, and the current node is ‘insulted’ then set the root to be ‘insulted’.

Methodology 3.If found any insulting word or phrase at the subject part – set the ‘insulted’ property of the current root to true. – The subject will become experiencer, no matter what it’s corresponding event is.

Methodology 4.If the ‘edgeFromParent’ property of current ‘insulted’ node is dobj (direct object), or iobj (indirect object) – Set the parent node (which must be a verb node) to be ‘insulted’ and its corresponding subject will be an experiencer. ‘nonsense’ is insulted and an dboj

Methodology 5.If the root verb has a ‘negative’ modifier and the current node has its ‘insulted’ property true. – Check its children: If any of its child nodes has the ‘label’ only then root’s ‘insulted’ property will be true, otherwise false.

Methodology 6. If an ‘insulted’ node’s ‘edgeFromParent’ property is as, like or to and the subject was ‘humanObj’ then root will be ‘insulted’. – Ex: He thinks like a donkey. humanObj

Methodology 7.If current node at subject part is ‘comparable’ and the root verb is ‘comparableVerb’, – then see whether any ‘humanObj’ is at object part and set the root’s ‘insulted’ property true.

Methodology 8.If a ‘humanObj’ is a modifier of an ‘insulted’ node then current root’s ‘insulted’ property will be true. – Otherwise, if a ‘humanObj’ is a modifier of the root verb and root verb also has a ‘insulted’ node as its modifier then it will be ‘insulted’.

Methodology 9.If the property of a node is ‘attributive’ then we got sequentially two checking. – First check whether the root node is ‘evaluative’ or ‘comparableVerb’. – If that is true then next checking is whether the root node has a ‘negative’ modifier or a ‘modifier’ modifies it. – If that is also true then set ‘insulted’ property of this root to true.

Methodology 10. If the current node is ‘humanObj’ and the ‘edgeFromParent’ property is by – Then it will be considered as a agent or experiencer.

Result and Discussion

Limitations 1.This system can annotate and distinguish any abusive or insulting sentence only bearing related words or phrases that must exist in the lexicon entry. 2.Our preprocessing part is not yet been full proved to handle all exceptions. 3.We didn’t yet handle any erroneous input such as misplacing of comma, unmatched punctuation marks etc. at our implemented system. 4.Our performance largely depends on the “Sentence detector of OpenNLP” tools and “stanford-parser”. – Since stanford-parser is a probabilistic parser, it is not guaranteed that all of its output is right. For those cases, this system also gives the wrong output.