Email Classification Results for Folder Classification on Enron Dataset.

Slides:



Advertisements
Similar presentations
Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao, Wei Fan, Jing Jiang, Jiawei Han l Motivate Solution Framework Data Sets Synthetic.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Cognitive Modelling – An exemplar-based context model Benjamin Moloney Student No:
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Text Categorization Karl Rees Ling 580 April 2, 2001.
Characteristic Identifier Scoring and Clustering for Classification By Mahesh Kumar Chhaparia.
Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
Active Learning to Classify
A (very) brief introduction to multivoxel analysis “stuff” Jo Etzel, Social Brain Lab
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Support Vector Machine
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
BuzzTrack Topic Detection and Tracking in IUI – Intelligent User Interfaces January 2007 Keno Albrecht ETH Zurich Roger Wattenhofer.
Rutgers’ HARD Track Experiences at TREC 2004 N.J. Belkin, I. Chaleva, M. Cole, Y.-L. Li, L. Liu, Y.-H. Liu, G. Muresan, C. L. Smith, Y. Sun, X.-J. Yuan,
Single Category Classification Stage One Additive Weighted Prototype Model.
Using IR techniques to improve Automated Text Classification
Analyzing Behavioral Features for Classification.
Announcements  Project teams should be decided today! Otherwise, you will work alone.  If you have any question or uncertainty about the project, talk.
Qualitative Data Analysis and Interpretation
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Optimization Theory Primal Optimization Problem subject to: Primal Optimal Value:
Automating Document Review Nathaniel Love CS 244n Final Project Presentation 6/14/2006.
Spam Detection Jingrui He 10/08/2007. Spam Types  Spam Unsolicited commercial  Blog Spam Unwanted comments in blogs  Splogs Fake blogs.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Project Fragmentation The Project Fragmentation Problem in Personal Information Management Bergman, et al CHI 2006 proceedings.
Chapter 14: Nonparametric Statistics
Masquerade Detection Mark Stamp 1Masquerade Detection.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan.
Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.
Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.
Outline What Neural Networks are and why they are desirable Historical background Applications Strengths neural networks and advantages Status N.N and.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Spam Detection Ethan Grefe December 13, 2013.
Taylor Rassmann.  Grouping data objects into X tree of clusters and uses distance matrices as clustering criteria  Two Hierarchical Clustering Categories:
LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 14 Comparing Groups: Analysis of Variance Methods Section 14.1 One-Way ANOVA: Comparing.
Effective Information Access Over Public Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Comparative Experiments on Sentiment Classification for Online Product Reviews Hang Cui, Vibhu Mittal, and Mayur Datar AAAI 2006.
CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines.
Intelligent Reply and Attachment Prediction Mark Dredze, Tova Brooks, Josh Carroll Joshua Magarick, John Blitzer, Fernando Pereira Presented by.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Using Linguistic Analysis and Classification Techniques to Identify Ingroup and Outgroup Messages in the Enron Corpus.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Classification using Co-Training
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
ARMA Boston Spring Seminar 2011 Jesse Wilkins, CRM.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Does one size really fit all? Evaluating classifiers in a Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Windows 7 Ultimate To Load Click Simulation PPSX
Bagging and Random Forests
System for Semi-automatic ontology construction
Applications of IScore (using R)
Collaborative Filtering Nearest Neighbor Approach
Basics of Retention 8/2018.
Outlook and Shared Drives
Text Mining Application Programming Chapter 9 Text Categorization
Presentation transcript:

Classification Results for Folder Classification on Enron Dataset

Overall Goals  To help users manage large volumes of .  … by helping them to sort their into folders.

Immediate Goals  To establish an credible test corpus  To create baseline results for classification  To analyze possible future techniques

The “ Enron ” Corpus  Previous classification experiments have used “ toy ” collections.  Enron s are collected from actual business users.  Made public through legal proceedings.

The Enron Corpus  158 users  200,399 s  Average of 757 s per user

Enron Data Analysis  Most users do use folders to classify their .  Some users with many s still have few folders.  Users with more s tend to have more in each folder.

Representation  From  To, CC  Subject  Body  Date/Time?  Thread?  Attachments?  etc … ?

Approaches  Using a bag-of-words data “ bag of words ” SVM classification decision

Approaches  Using separate SVMs for each section data SVMs classification decision LLSF

Approach  Data was split in half, chronologically.  A “ flat ” approach was used. (not hierarchical)  An SVM was trained for each folder for each user for each field.  The SVM for each folder was trained using all of the s for that user.  Combination weights were found with a regression for each folder.  Thresholding was performed for optimal F1 score, using the “ scut ” method.

“ Enron ” Results Analysis  Obviously some data fields are more useful than others.  Unsurprisingly, the “ To, CC ” data is the least useful.  Body is the most useful field, followed closely by sender.  Using all fields works better than using any particular field alone.  Linearly combining fields works better than bag-of-words approach.  Because it ’ s SVM, the linear weights are not directly interpretable.

Enron Results Analysis  F1 classification score is unrelated to the number of s a user has.

Enron Results Analysis  F1 score is somewhat correlated with the number of folders a user has.  s are much harder to classify for users with many folders.

Enron Thread Analysis  200,399 messages  101,786 threads  30,091 non-trivial threads  61.63% messages are in non-trivial threads  Average of 4.1 messages/thread  Median of 2 messages/thread

Enron Thread Analysis  Largest threads are most potentially useful. But, the largest threads are the least common.  Threads are also redundant with other kinds of evidence. Since threads are detected by subject and sender, much of the thread information is redundant. Also, s in the same thread tend to have similar bodies.  Largest thread in the Enron corpus is 1124 copies of the same message … all in the “ Deleted Items ” folder for a particular user!