Classification Results for Folder Classification on Enron Dataset
Overall Goals To help users manage large volumes of . … by helping them to sort their into folders.
Immediate Goals To establish an credible test corpus To create baseline results for classification To analyze possible future techniques
The “ Enron ” Corpus Previous classification experiments have used “ toy ” collections. Enron s are collected from actual business users. Made public through legal proceedings.
The Enron Corpus 158 users 200,399 s Average of 757 s per user
Enron Data Analysis Most users do use folders to classify their . Some users with many s still have few folders. Users with more s tend to have more in each folder.
Representation From To, CC Subject Body Date/Time? Thread? Attachments? etc … ?
Approaches Using a bag-of-words data “ bag of words ” SVM classification decision
Approaches Using separate SVMs for each section data SVMs classification decision LLSF
Approach Data was split in half, chronologically. A “ flat ” approach was used. (not hierarchical) An SVM was trained for each folder for each user for each field. The SVM for each folder was trained using all of the s for that user. Combination weights were found with a regression for each folder. Thresholding was performed for optimal F1 score, using the “ scut ” method.
“ Enron ” Results Analysis Obviously some data fields are more useful than others. Unsurprisingly, the “ To, CC ” data is the least useful. Body is the most useful field, followed closely by sender. Using all fields works better than using any particular field alone. Linearly combining fields works better than bag-of-words approach. Because it ’ s SVM, the linear weights are not directly interpretable.
Enron Results Analysis F1 classification score is unrelated to the number of s a user has.
Enron Results Analysis F1 score is somewhat correlated with the number of folders a user has. s are much harder to classify for users with many folders.
Enron Thread Analysis 200,399 messages 101,786 threads 30,091 non-trivial threads 61.63% messages are in non-trivial threads Average of 4.1 messages/thread Median of 2 messages/thread
Enron Thread Analysis Largest threads are most potentially useful. But, the largest threads are the least common. Threads are also redundant with other kinds of evidence. Since threads are detected by subject and sender, much of the thread information is redundant. Also, s in the same thread tend to have similar bodies. Largest thread in the Enron corpus is 1124 copies of the same message … all in the “ Deleted Items ” folder for a particular user!