Characteristic Identifier Scoring and Clustering for Email Classification By Mahesh Kumar Chhaparia.

Slides:

Advertisements

Similar presentations

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

Advertisements

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.

SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert.

Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

Evaluating Search Engine

Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.

Ensemble Learning: An Introduction

Scalable Text Mining with Sparse Generative Models

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Nonresponse issues in ICT surveys Vasja Vehovar, Univerza v Ljubljani, FDV Bled, June 5, 2006.

Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

Presented by Tienwei Tsai July, 2005

Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.

Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.

Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.

SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.

Chapter 6: Information Retrieval and Web Search

Using Support Vector Machines to Enhance the Performance of Bayesian Face Recognition IEEE Transaction on Information Forensics and Security Zhifeng Li,

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)

Limitations of Cotemporary Classification Algorithms Major limitations of classification algorithms like Adaboost, SVMs, or Naïve Bayes include, Requirement.

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

NEW EVENT DETECTION AND TOPIC TRACKING STEPS. PREPROCESSING Removal of check-ins and other redundant data Removal of URL’s maybe Stemming of words using.

Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.

1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.

AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

BotCop: An Online Botnet Traffic Classifier 鍾錫山 Jan. 4, 2010.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Classification Results for Folder Classification on Enron Dataset.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

January 2012Spelling Models1 Human Language Technology Spelling Models.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )

Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.

System for Semi-automatic ontology construction

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.

Generalizations of Markov model to characterize biological sequences

Text Mining Application Programming Chapter 9 Text Categorization

Presentation transcript:

Characteristic Identifier Scoring and Clustering for Classification By Mahesh Kumar Chhaparia

Clustering Given a set of unclassified s, the objective is to produce high purity clusters keeping the training requirements low. Outline: –Characteristic Identifier Scoring and Clustering (CISC), Identifier Set Scoring Clustering Directed Training –Comparison of CISC with some of the traditional ideas in clustering –Comparison of CISC with POPFile (Naïve-Bayes classifier), –Caveats –Conclusion

Evaluation Evaluation on Enron Dataset for the following users (purity measured w.r.t the grouping already available): UserNumber of folders Number of Messages Messages in smallest folder Messages in largest folder Lokay-M Beck-S Sanders-R Williams-w Farmer-D Kitchen-L Kaminski-V

CISC: Identifier Set Sender and Recipients Words from the subject starting with uppercase Tokens from the message body –Word sequences with each word starting in uppercase (length [2,5] only) split about stopwords (excluding them) –Acronyms (length [2,5] only) –Words followed by an apostrophe and ‘s’ e.g. TW’s extracted to TW –Words or phrases in quotes e.g. “Trans Western” –Words where any character (excluding first is in uppercase) e.g. eSpeak, ThinkBank etc.

CISC: Scoring Sender: –Initial idea: generate clusters of addresses with frequency of communication above some threshold, (+) Identifies “good” clusters of communication (-) Difficult to score when an has addresses spread across more than one cluster (-) Fixed partitioning and difficult to update

CISC: Scoring (Contd…) Sender: –Need a notion of soft clustering with both recipients and content –Generate a measure of its non-variability with respect to the addresses it co-occurs with or the content it discusses in s –Example: 1  {2,3} {3,4} {2,3,4} in Folder 1 2  {1} {3} {4} {1} {3} {1,3} in Folder 2 Emphasizes social clusters {1,2,3} {1,3,4} Classify 2  {1,3,4} –Traditionally: Folder 2 (address frequency based) –CISC: Folder 1 (social cluster based) –Difficult to say upfront which is better ! –Efficacy discussed later

CISC: Scoring (Contd…) Words or Phrases: –Generate a measure of its importance –Using context captured through the co-occurring text –Sample scenarios for score generation: Different functional groups in a company mentioning “Conference Room”  Low score A single shipment discussion for company “CERN”  High score Several different topic discussions (financial, operational etc.) for company “TW”  Low score Clustering: Pair with highest similarity message and merge clusters sharing atleast one message to produce disjoint clusters Directed Training: –For each cluster, identify a message likely to belong to majority class –Suggest the user to classify this message

Efficacy of TF-IDF Cosine Similarity Clustering using the traditional TF-IDF cosine similarity measure for s not very effective ! Note: Both TF-IDF and CISC figures with only word and phrase tokens Number of clusters is different in both cases, but the purity figures indicate the discriminative capability of the respective algorithms UserTF-IDF (% Purity before merging) TF-IDF (% Purity) CISC (% Purity) Lokay-M Beck-S Sanders-R Williams-w

Efficacy of Social Cluster Based Scoring Results UserCISC (with social clusters) (% Purity) CISC (without social clusters) (% Purity) Lokay-M Beck-S Sanders-R Williams-w

CISC vs. POPFile Results Purity may sometimes (marginally) decrease with increasing training set in POPFile ! # Training Messages Lokay-MBeck-SSanders-RWilliams-w CISC80.47 (265)52.81 (218)75.67 (146)91.40 (153) (614)71.47 (587)84.79 (332)93.38 (365)

Conclusion Given a set of unclassified s, the proposed strategy obtains higher clustering purity with lower training requirements than POPFile and TF-IDF based method. Key differentiators: –Incorporates a combination of communication cluster and content variability based scoring for senders instead of the usual tf-idf scoring or naïve-bayes word model (POPFile), –Picks a set of high-selectivity features for final message similarity model than retaining most content of messages (i.e. all non-stopwords), –Observes and uses the fact that any in a class may be “close” to only a small number of s than to all in that class, –Finally, helps lower training requirements through “directed training” than indiscriminate training over as many s as possible.

Future Work Design and evaluation for non-corporate datasets Tuning of message similarity scoring –Different weights for the score components –Different range normalization for different components to boost proportionally –Test feature score proportional to its length Richer feature set –Phrases following ‘the’ –Test with substring-free collection e.g. “TW Capacity Release Report” and “TW” are replaced with “Capacity Release Report” and “TW” Hierarchical word scoring to change granularity of clustering Online classification using training directed feature extraction Merging high purity clusters effectively to further reduce training requirements

Q &A