Effective Information Access Over Public Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005
Introduction and Motivation Information within a newsgroup or a mailing list has largely been underutilized. For now, access to those data restricted to traditional search and browsing. Mail traffic also grows rapidly For example, the Tomcat (the Java-based web application engine) mailing list has more than 37,000 messages from March 2003 to March That’s around 101 messages per day! Can we access those information more effectively?
Existing Technologies Search Browse
Project Goals Thread Detection Detects topic shift within a thread Challenge: W can not find such cases in our collection. So we will not explore it in our projects. But it is still a quite interesting research question in domain. Clustering Group the similar threads together Challenges: How to define the similarity function between two threads? How to evaluate the clustering results? Summarizing Generate the summary for each cluster Challenge: How to identify the important part in each cluster? How to evaluate the summarization results? Interface to view the clustering result
The Corpus Newsgroup archive for 3 Computer Science classes (CS473, CS475, and CS225) at UIUC for Fall Each newsgroup contains messages for a complete semester for the given class. Unlike previous newsgroup clustering tasks: Use thread instead of an individual message as the unit. We cluster based on subtopics within a newsgroup
Progress So Far Implemented clustering by using the CEES (Conversation Extraction and Evaluation Service) architecture CEES provides an architecture to Gather messages and construct thread trees Parse, index, search, and cluster threads Integration with Lucene and Weka Cluster threads using different fields Created the judgment files for evaluating the clustering results manually
Clustering Use agglomerative clustering algorithm Similarity = dot product of Okapi-weighted vectors of corresponding fields Computes the similarity of: Contents Subject Contents without quote First message Rest of thread Rest of thread without quote Participants in a thread ( addresses in the “From:”) Linear regression using all the above features Logistic regression using all the above features
Overall Entropy=0.5*Cluster Entropy + 0.5*Class Entropy Cluster Quality Measures (He2002) Cluster EntropyClass Entropy Result Actual
Clustering Performance Cluster Entropy Class Entropy
Clustering Performance(2) Overall Entropy=0.53*Cluster Entropy *Class Entropy
Remaining Work Clustering Find a more reasonable cluster quality measure Study why sometimes learned similarity function performs worse than baseline Find a better way to learn the similarity function Summarization Divide it into two subtasks Summarization of announcement-driven discussion Summarization of question-driven discussion Evaluation Create judgement files Evaluation measures