Download presentation
Presentation is loading. Please wait.
Published byCandice Melton Modified over 9 years ago
1
Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. Gray California State University at Northridge Roger Debreceny University of Hawai`i at Mānoa 5th Symposium on Information Systems Assurance Toronto: October 2007
2
In this Presentation Continuous monitoring of emails – why? Technologies Social Network Analysis Text analysis Challenges Opportunities
3
Continuous Monitoring of Emails – Why? Increased focus on forensic approaches to auditing Increased interest in continuous assurance and monitoring of business processes Emails = Organization’s DNA Evidential matter on: Employee & management fraud (overrides) Compliance (e.g., HIPAA) Loss of intellectual property Corporate policies
4
Enron Email Archive Released by Federal Energy Regulatory Commission 500K emails 151 Enron employees Cleaned version at Carnegie Mellon www.cs.cmu.edu/~enron/ www.cs.cmu.edu/~enron/ Relational DB version at USC www.isi.edu/~adibi/Enron/Enron_Dataset_R eport.pdf www.isi.edu/~adibi/Enron/Enron_Dataset_R eport.pdf
5
Email Mining Targets
6
Content Analysis
7
Key Word Queries Yes, people do say self-incriminating things in their emails Fraud Corporate dysfunction Overwhelming false positives Need “smart” compound queries Good continuous auditing (CA) candidate Already scanning for spam, porn, etc.
8
Sender Deception -- Content Deceptive emails include: Fewer first-person pronouns to dissociate themselves from their own words Fewer exclusive words, such as but and except, to indicate a less complex story More negative emotion words because of the sender’s underlying feeling of guilt More action verbs to, again, indicate a less complex story
9
Sender Deception -- Identification Writeprint features Lexical -- characters & words Function words Root words Syntactic -- sentences Structural -- paragraphs Content-specific
10
Sender Deception -- Identification Number of potential features unlimited Optimum number can vary by context and language Developing user profiles and comparing new emails to profiles would be challenging for real-time CA
11
Temporal/Log Analysis
12
Volume & Velocity Volume = number of emails a person sends and/or receives over a period of time. Velocity = how quickly the volume changes. Many external factors (e.g., vacations, seasonal activities, etc.) impact these numbers Need “rolling histogram”
13
Volume & Velocity Key issue -- determining the optimum time intervals to sample the data Continuous monitoring cannot be continuous in terms of sampling in real time Comparing hourly, daily, and even weekly volumes and velocities will result in many false positives Optimum time internal could vary by job title
14
Social Network Analysis
15
Social relationships as an undirected graph Importance of understanding relationships within the flow of email exchanges
16
Social Network Analysis in Emails Emails semi-structured data sender primary recipient(s) copied recipient(s) date subject line Social groups and cliques CA = who doesn’t belong?
17
Thread Analysis – This? Time S R C C SR C C R C C S S R C C
18
Thread Analysis – Or this? Time S R C C S R R C S C R RS R
19
Integrating Content Analysis and Social Network Analysis
20
Challenges of Email Mining Textual Inconsistent use of abbreviations Misspelled words Smileys etc. etc. Replies, replies, and more replies… Inability to identify: Identities of email participants anon@anon.mail.sender.net Roles and responsibilities
21
What Enron Emails Show? People do say the darnest things What did he know and when did he know it? Verified numerous bodies of email data mining research Content analysis Social network analysis
22
Tools Content monitoring eSoft Corporation’s ThreatWall Symantec’s Mail Security 8x00 Series Vericept Corporation’s Vericept Content 360º Reconnex Corporation’s iGuard Appliance InBoxer, Inc. Anti-Risk Appliance Social networks Microsoft SNARF Heer Vizter
23
Research Opportunities
24
Research Questions Role of email monitoring in overall CA environment? Join SNA with examination of textual patterns. Link SNA with control environment Frauds/control overrides footprint? What email cleaning is required for CA purposes? Privacy and policy issues? Lessons from existing commercial products?
25
Your Questions Thank You glen.gray@csun.edu glen.gray@csun.edu rogersd@hawaii.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.