Presentation is loading. Please wait.

Presentation is loading. Please wait.

Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset Yingjie Zhou, Research Assistant, RPI Mark Goldberg, Professor,

Similar presentations


Presentation on theme: "Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset Yingjie Zhou, Research Assistant, RPI Mark Goldberg, Professor,"— Presentation transcript:

1 Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset Yingjie Zhou, Research Assistant, RPI Mark Goldberg, Professor, RPI Malik Magdon-Ismail, Associate Professor, RPI William A. Wallace, Professor, RPI Supported by the NSF Grants #0324947, #0323324, #0634875, #0522672, and by the ONR Grant # N00014-06-1-0466

2 6/8/2007NAACSOS 20072 Outline Introduction Properties of Organizational Emails Difficulties in Cleaning Organizational Emails Procedures of Cleaning Organizational Emails Introduction to Enron Email Dataset Application of Cleaning Procedures to Enron Email Dataset Results Conclusions and Future Work

3 6/8/2007NAACSOS 20073 Introduction Emails Organizational emails Inter-organizational emails Intra-organizational emails The features of organizational email data make it potential for various studies Email data has its own problems and is noisy 5 2 3 8 9 16 4 7

4 6/8/2007NAACSOS 20074 Properties of Organizational Emails Emails are formatted, and the format is usually defined and followed. Emails are normally stored in a server and can be easily collected. Emails are unobtrusive. Emails are time stamped. In addition, The senders and recipients of the emails are employees of the organization. Each employee is normally assigned one or more unique email addresses within the organizational domain.

5 6/8/2007NAACSOS 20075 Difficulties in Cleaning Organizational Emails Multiple email addresses, names, or IDs exist for the same person. Duplicate emails exist. The content of the email is difficult to extract.

6 6/8/2007NAACSOS 20076 Procedures of Cleaning Organizational Emails Map aliases to employees Parse last name, first name, and email ID in headers Raw Formats Extracted Formats Employee 1 Raw Formats Extracted Formats Employee 2 Raw Formats Extracted Formats Employee N …… Organizational Email Dataset Generalized Formats

7 6/8/2007NAACSOS 20077 Procedures of Cleaning Organizational Emails (Cont’d) Remove duplicate emails content + date + recipients Consolidate date and time Convert to machine time Extract email Content Signatures Features of parent email message Greetings and names Organizational Email Dataset Generalized Formats Unique Message Email Dataset Remove Duplicates Employee Email Dataset Cleaned Employee Email Dataset Date & Time Consolidation Content Extraction

8 6/8/2007NAACSOS 20078 Introduction to Enron Email Dataset Federal Energy Regulatory Commission (FERC) posted the Enron email dataset on the web in May of 2002 619,446 emails Professor Leslie Kaelbling from MIT purchased the dataset SRI - integrity and security Professor William W. Cohen - CMU dataset 150 user folders 517,431 emails 400Mb

9 6/8/2007NAACSOS 20079 Introduction to Enron Email Dataset (Cont’d) Sender Receiver/Receivers Date + Time Subject Body ?Forwarded or replied text ?Signature  Attachment Message-ID: Date: Thu, 30 Nov 2000 08:50:00 -0800 (PST) From: eugenio.perez@enron.com To: sally.beck@enron.com Subject: Self Evaluation - Short Version Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-From: Eugenio Perez X-To: Sally Beck X-cc: X-bcc: X-Folder: \Sally_Beck_Nov2001\Notes Folders\All documents X-Origin: BECK-S X-FileName: sbeck.nsf Please let me know if you need anything else. Regards, Eugenio

10 6/8/2007NAACSOS 200710 Introduction to Enron Email Dataset (Cont’d) From, To, Cc, Bcc X-From, X-To, X-cc, X-bcc Example1: davis-d\deleted_items\101 From: dana.davis@enron.com To: dana.davis@enron.com X-From: Davis, Mark Dana X-To: Davis, Dana Example2: cash-m\sent_items\505 From: michelle.cash@enron.com To: legal X-From: Cash, Michelle X-To: Taylor, Mark E (Legal) Doesn’t make sense! Wrong!

11 6/8/2007NAACSOS 200711 Application of Cleaning Procedures to Enron Email Dataset phillip k allen phillip allen allen, phillip allen, phillip k. phillip k allen allen, phillip allen, phillip k. phillip.k.allen@enron.com phillip.allen@enron.com pallen@enron.com pallen70@hotmail.com pallen@ect.enron.com pallen@hotmail.com pallen@enron.com “phillip allen” “pallen@enron.com" phillip phillip allen “allen, phillip k" pallen@enron.com

12 6/8/2007NAACSOS 200712 Application of Cleaning Procedures to Enron Email Dataset (Cont’d) 150 folders => 156 employees 517,431 emails => 252,830 unique emails All emails are from the same time zone, and emails with wrong dates are discarded 22,241 emails among 156 employees from Nov. 1998 – Jun. 2002 “Original Message”, “Forwarded by”, “Thanks”, “Regards”, etc. Signatures Susan S. Bailey Senior Legal Specialist Enron Wholesale Services Legal Department 1400 Smith Street, Suite 3803A Houston, Texas 77002 phone: (713) 853-4737 fax: (713) 646-3490 email: susan.bailey@enron.com

13 6/8/2007NAACSOS 200713 Conclusions and Future Work Conclusions In general, the procedures are practical and served well in cleaning the Enron emails. Future Work Name disambiguation Misdirected email detection Broadcast emails removal Various analysis

14 6/8/2007NAACSOS 200714 Thank you! Any Comments?


Download ppt "Strategies for Cleaning Organizational Emails with an Application to Enron Email Dataset Yingjie Zhou, Research Assistant, RPI Mark Goldberg, Professor,"

Similar presentations


Ads by Google