Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Linguistic Analysis and Classification Techniques to Identify Ingroup and Outgroup Messages in the Enron Email Corpus.

Similar presentations


Presentation on theme: "Using Linguistic Analysis and Classification Techniques to Identify Ingroup and Outgroup Messages in the Enron Email Corpus."— Presentation transcript:

1 Using Linguistic Analysis and Classification Techniques to Identify Ingroup and Outgroup Messages in the Enron Corpus

2 Introduction “America's most innovative company from 1999 to 2000", Enron was the 7th largest company in the United States Enron had 21,000 employees in mid-2001 Went bankrupt in December 2001 Analysis involved linguistic analysis of the publicly available Enron corpus

3 Research Direction Can linguistic cues used in deception detection be utilized to identify other classifications? Ingroup vs. outgroup communication Motivators Baseline truth and deception may be too difficult or costly to identify Existing automated techniques could be readily applied

4 Ingroup and Outgroup Communication
Social Identity Theory (Tajfel and Turner): discrimination in favor of ingroup and in opposition to outgroup Includes prejudice, stereotyping, negotiation, and language use Linguistic Masking (Platow and Broadie) Done with strategic use of passive and active voice Judee – not sure if you’d like to keep this slide as-is, or remove it and just discuss the appropriate literature without a list…

5 Linguistic Cues for Deception Detection
Existing research of automated linguistic analysis of asynchronous computer-mediated communication Better than chance Existing and established cues and automation techniques could be applied to similar classification schemes Cues identified by Twitchell et al (2005) used in this study Judee – not sure if you’d like to keep this slide as-is, or remove it and just discuss the appropriate literature without a list…

6 Methodology Define our selection criteria
Ingroup: communication between people found guilty, submitted a guilty plea, or awaiting trial Outgroup: communication from a person found guilty, submitted a guilty plea, or awaiting trial to a person not convicted or charged Identify ingroup and outgroup members News articles Court transcripts Extract senders from s Identify ingroup and outgroup messages Enron uses an address convention that includes the first and last name of almost every sender.  By parsing the sender addresses in the over 200,000 s in the database, we were able to increase the number of employees from which we can identify from 151 to over 5,000, including almost everyone Arab had identified as being proven guilty, submitting a guilty plea, or awaiting trial. - To make in-goup and out-group same sample sizes, took a random sample of 29 out-group messages for comparison

7 Methodology (cont) Email Identification
Publicly available corpus appears to include from 151 employees Parsing the sender/receiver addresses (using Enron naming convention which includes first and last name) resulted in an actual employee count of over 5,000 Identified 29 ingroup messages and over 600 outgroup messages Random sample of 29 outgroup messages was used for analysis Analyze identified s with GATE and Weka Enron uses an address convention that includes the first and last name of almost every sender.  By parsing the sender addresses in the over 200,000 s in the database, we were able to increase the number of employees from which we can identify from 151 to over 5,000, including almost everyone Arab had identified as being proven guilty, submitting a guilty plea, or awaiting trial. - To make in-goup and out-group same sample sizes, took a random sample of 29 out-group messages for comparison

8 Methodology: GATE and Weka
GATE (General Architecture for Text Engineering) Extracted 39 features for each message Weka (Waikato Environment for Knowledge Analysis) Classification engine supporting decision trees, neural networks, and other AI algorithms GATE: open source text engineering program Short list of the features used to analyze the Enron corpus: Lexical_Diversity Emotiveness Pausality Word_Quantity Verb_Quantity Modifier_Quantity Sentence_Quantity passive_verb_ratio modal_verb_ratio Affect_Ratio

9 Automated Analysis Results
Using a J48 decision tree with ten-fold cross-validation Accurately classified 48 out of 58 messages as ingroup or outgroup (82.7% accuracy). Cues Only 5 out of 39 cues were needed for classification: Pleasantness Average Sentence Length Verb Quality You References Passive Verb Ratio

10 J48 Decision Tree Pleasantness <= 0.007673: false (19.0/1.0)
| Average_Sentence_Length <= 34.5 | | Verb_Quantity <= 8: false (3.0) | | Verb_Quantity > 8 | | | You_References <= : true (20.0) | | | You_References > | | | | passive_verb_ratio <= 0: true (9.0/1.0) | | | | passive_verb_ratio > 0: false (2.0) | Average_Sentence_Length > 34.5: false (5.0) Pleasantness is a ratio of pleasant words to total words. Average sentence length is found by dividing the total number of words by the total number of sentences. Verb quantity is the total number of verbs in the message. You References is defined as the ratio of second person pronouns to the total number of words in the message. Passive Verb Ratio is the ratio of passive verbs to the number of total verbs in the message.

11 Future Research Directions
Perform similar analysis on transcripts of wire-tapped phone conversations (also publicly available) Perform additional research to identify deceptive and non-deceptive messages from Enron corpus Explore additional ingroup and outgroup scenarios

12 Questions?


Download ppt "Using Linguistic Analysis and Classification Techniques to Identify Ingroup and Outgroup Messages in the Enron Email Corpus."

Similar presentations


Ads by Google