Presentation is loading. Please wait.

Presentation is loading. Please wait.

Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.

Similar presentations


Presentation on theme: "Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005."— Presentation transcript:

1 Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005

2 Introduction and Motivation Information within a newsgroup or a mailing list has largely been underutilized. For now, access to those data restricted to traditional search and browsing. Mail traffic also grows rapidly  For example, the Tomcat (the Java-based web application engine) mailing list has more than 37,000 messages from March 2003 to March 2004. That’s around 101 messages per day! Can we access those information more effectively?

3 Existing Technologies Search Browse

4 Project Goals Thread Detection  Detects topic shift within a thread  Challenge: W can not find such cases in our collection. So we will not explore it in our projects. But it is still a quite interesting research question in email domain. Clustering  Group the similar threads together  Challenges: How to define the similarity function between two threads? How to evaluate the clustering results? Summarizing  Generate the summary for each cluster  Challenge: How to identify the important part in each cluster? How to evaluate the summarization results? Interface to view the clustering result

5 The Corpus Newsgroup archive for 3 Computer Science classes (CS473, CS475, and CS225) at UIUC for Fall 2004. Each newsgroup contains messages for a complete semester for the given class. Unlike previous newsgroup clustering tasks:  Use thread instead of an individual message as the unit.  We cluster based on subtopics within a newsgroup

6 Progress So Far Implemented clustering by using the CEES (Conversation Extraction and Evaluation Service) architecture  CEES provides an architecture to Gather messages and construct thread trees Parse, index, search, and cluster threads Integration with Lucene and Weka Cluster threads using different fields Created the judgment files for evaluating the clustering results manually

7 Clustering Use agglomerative clustering algorithm Similarity = dot product of Okapi-weighted vectors of corresponding fields Computes the similarity of:  Contents  Subject  Contents without quote  First message  Rest of thread  Rest of thread without quote  Participants in a thread (email addresses in the “From:”)  Linear regression using all the above features  Logistic regression using all the above features

8 Overall Entropy=0.5*Cluster Entropy + 0.5*Class Entropy Cluster Quality Measures (He2002) 1212 3434 5 123123 4545 Cluster EntropyClass Entropy Result Actual

9 Clustering Performance Cluster Entropy Class Entropy

10 Clustering Performance(2) Overall Entropy=0.53*Cluster Entropy + 0.47*Class Entropy

11 Remaining Work Clustering  Find a more reasonable cluster quality measure  Study why sometimes learned similarity function performs worse than baseline  Find a better way to learn the similarity function Summarization  Divide it into two subtasks Summarization of announcement-driven discussion Summarization of question-driven discussion  Evaluation Create judgement files Evaluation measures


Download ppt "Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005."

Similar presentations


Ads by Google