Strategies for Cleaning Organizational s with an Application to Enron Dataset Yingjie Zhou, Research Assistant, RPI Mark Goldberg, Professor, RPI Malik Magdon-Ismail, Associate Professor, RPI William A. Wallace, Professor, RPI Supported by the NSF Grants # , # , # , # , and by the ONR Grant # N
6/8/2007NAACSOS Outline Introduction Properties of Organizational s Difficulties in Cleaning Organizational s Procedures of Cleaning Organizational s Introduction to Enron Dataset Application of Cleaning Procedures to Enron Dataset Results Conclusions and Future Work
6/8/2007NAACSOS Introduction s Organizational s Inter-organizational s Intra-organizational s The features of organizational data make it potential for various studies data has its own problems and is noisy
6/8/2007NAACSOS Properties of Organizational s s are formatted, and the format is usually defined and followed. s are normally stored in a server and can be easily collected. s are unobtrusive. s are time stamped. In addition, The senders and recipients of the s are employees of the organization. Each employee is normally assigned one or more unique addresses within the organizational domain.
6/8/2007NAACSOS Difficulties in Cleaning Organizational s Multiple addresses, names, or IDs exist for the same person. Duplicate s exist. The content of the is difficult to extract.
6/8/2007NAACSOS Procedures of Cleaning Organizational s Map aliases to employees Parse last name, first name, and ID in headers Raw Formats Extracted Formats Employee 1 Raw Formats Extracted Formats Employee 2 Raw Formats Extracted Formats Employee N …… Organizational Dataset Generalized Formats
6/8/2007NAACSOS Procedures of Cleaning Organizational s (Cont’d) Remove duplicate s content + date + recipients Consolidate date and time Convert to machine time Extract Content Signatures Features of parent message Greetings and names Organizational Dataset Generalized Formats Unique Message Dataset Remove Duplicates Employee Dataset Cleaned Employee Dataset Date & Time Consolidation Content Extraction
6/8/2007NAACSOS Introduction to Enron Dataset Federal Energy Regulatory Commission (FERC) posted the Enron dataset on the web in May of ,446 s Professor Leslie Kaelbling from MIT purchased the dataset SRI - integrity and security Professor William W. Cohen - CMU dataset 150 user folders 517,431 s 400Mb
6/8/2007NAACSOS Introduction to Enron Dataset (Cont’d) Sender Receiver/Receivers Date + Time Subject Body ?Forwarded or replied text ?Signature Attachment Message-ID: Date: Thu, 30 Nov :50: (PST) From: To: Subject: Self Evaluation - Short Version Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-From: Eugenio Perez X-To: Sally Beck X-cc: X-bcc: X-Folder: \Sally_Beck_Nov2001\Notes Folders\All documents X-Origin: BECK-S X-FileName: sbeck.nsf Please let me know if you need anything else. Regards, Eugenio
6/8/2007NAACSOS Introduction to Enron Dataset (Cont’d) From, To, Cc, Bcc X-From, X-To, X-cc, X-bcc Example1: davis-d\deleted_items\101 From: To: X-From: Davis, Mark Dana X-To: Davis, Dana Example2: cash-m\sent_items\505 From: To: legal X-From: Cash, Michelle X-To: Taylor, Mark E (Legal) Doesn’t make sense! Wrong!
6/8/2007NAACSOS Application of Cleaning Procedures to Enron Dataset phillip k allen phillip allen allen, phillip allen, phillip k. phillip k allen allen, phillip allen, phillip k. “phillip allen” phillip phillip allen “allen, phillip k"
6/8/2007NAACSOS Application of Cleaning Procedures to Enron Dataset (Cont’d) 150 folders => 156 employees 517,431 s => 252,830 unique s All s are from the same time zone, and s with wrong dates are discarded 22,241 s among 156 employees from Nov – Jun “Original Message”, “Forwarded by”, “Thanks”, “Regards”, etc. Signatures Susan S. Bailey Senior Legal Specialist Enron Wholesale Services Legal Department 1400 Smith Street, Suite 3803A Houston, Texas phone: (713) fax: (713)
6/8/2007NAACSOS Conclusions and Future Work Conclusions In general, the procedures are practical and served well in cleaning the Enron s. Future Work Name disambiguation Misdirected detection Broadcast s removal Various analysis
6/8/2007NAACSOS Thank you! Any Comments?