Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Reinventing using REST. Anything addressable by a URI is called a resource GET, PUT, POST, DELETE WebDAV (MOVE, LOCK)
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Implicit Queries for Vitor R. Carvalho (Joint work with Joshua Goodman, at Microsoft Research)
Automatic Discovery of Useful Facet Terms Wisam Dakka – Columbia University Rishabh Dayal – Columbia University Panagiotis G. Ipeirotis – NYU.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Search Engines and Information Retrieval
Video Table-of-Contents: Construction and Matching Master of Philosophy 3 rd Term Presentation - Presented by Ng Chung Wing.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
BuzzTrack Topic Detection and Tracking in IUI – Intelligent User Interfaces January 2007 Keno Albrecht ETH Zurich Roger Wattenhofer.
ADVISE: Advanced Digital Video Information Segmentation Engine
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Introduction to WEKA Aaron 2/13/2009. Contents Introduction to weka Download and install weka Basic use of weka Weka API Survey.
1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System Supervisor: Prof Michael Lyu Presented by: Lewis Ng,
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text Soo-Min Kim and Eduard Hovy USC Information Sciences Institute 4676.
Brett Hlavinka and Chris Aikens. Imagine…  You’re a CSCE Junior about to start upper-level courses  You’re frustrated with howdy and its uselessness.
© M. Eisenberg 2010 Approach to Information Problem-Solving Introducing.
Public Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Conversations William Lee,
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Sharad Oberoi and Susan Finger Carnegie Mellon University DesignWebs: Towards the Creation of an Interactive Navigational Tool to assist and support Engineering.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal VideoConference Archives Indexing System.
A Language Independent Method for Question Classification COLING 2004.
Amy Dai Machine learning techniques for detecting topics in research papers.
Feature Detection in Ajax-enabled Web Applications Natalia Negara Nikolaos Tsantalis Eleni Stroulia 1 17th European Conference on Software Maintenance.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns Author : Stamatina Thomaidou, Konstantinos Leymonis, and Michalis Vazirgiannis.
Evidence-Based Teaching: Evaluative Strategies ED B MARCH 13, 2012 Dr. Anne Belcher, Dr. Linda Adamson, Instructors.
Duraid Y. Mohammed Philip J. Duncan Francis F. Li. School of Computing Science and Engineering, University of Salford UK Audio Content Analysis in The.
Multi-Semester Effort and Experience to Integrate NSF/IEEE-TCPP PDC into Multiple Department- wide Core Courses of Computer Science and Technology Department.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
INFORMATION RETRIEVAL PROJECT Creation of clusters of concepts that represent a domain corpus.
QBSH Corpus The QBSH corpus provided by Roger Jang [1] consists of recordings of children’s songs from students taking the course “Audio Signal Processing.
Progress presentation
ASSOCIATIVE BROWSING Evaluating 1 Jinyoung Kim / W. Bruce Croft / David Smith for Personal Information.
Using Linguistic Analysis and Classification Techniques to Identify Ingroup and Outgroup Messages in the Enron Corpus.
A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Document Clustering for Natural Language Dialogue-based IR (Google for the Blind) Antoine Raux IR Seminar and Lab Fall 2003 Initial Presentation.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Motivation Conclusion Effective Access Over Public Conversations William Lee, Hui Fang and Yifan Li University of Illinois at Urbana-Champaign Clustering.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Information Organization: Overview
2019/1/1 High Performance Intrusion Detection Using HTTP-Based Payload Aggregation 2017 IEEE 42nd Conference on Local Computer Networks (LCN) Author: Felix.
CSE 635 Multimedia Information Retrieval
Information Organization: Overview
Presentation transcript:

Effective Information Access Over Public Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005

Introduction and Motivation Information within a newsgroup or a mailing list has largely been underutilized. For now, access to those data restricted to traditional search and browsing. Mail traffic also grows rapidly  For example, the Tomcat (the Java-based web application engine) mailing list has more than 37,000 messages from March 2003 to March That’s around 101 messages per day! Can we access those information more effectively?

Existing Technologies Search Browse

Project Goals Thread Detection  Detects topic shift within a thread  Challenge: W can not find such cases in our collection. So we will not explore it in our projects. But it is still a quite interesting research question in domain. Clustering  Group the similar threads together  Challenges: How to define the similarity function between two threads? How to evaluate the clustering results? Summarizing  Generate the summary for each cluster  Challenge: How to identify the important part in each cluster? How to evaluate the summarization results? Interface to view the clustering result

The Corpus Newsgroup archive for 3 Computer Science classes (CS473, CS475, and CS225) at UIUC for Fall Each newsgroup contains messages for a complete semester for the given class. Unlike previous newsgroup clustering tasks:  Use thread instead of an individual message as the unit.  We cluster based on subtopics within a newsgroup

Progress So Far Implemented clustering by using the CEES (Conversation Extraction and Evaluation Service) architecture  CEES provides an architecture to Gather messages and construct thread trees Parse, index, search, and cluster threads Integration with Lucene and Weka Cluster threads using different fields Created the judgment files for evaluating the clustering results manually

Clustering Use agglomerative clustering algorithm Similarity = dot product of Okapi-weighted vectors of corresponding fields Computes the similarity of:  Contents  Subject  Contents without quote  First message  Rest of thread  Rest of thread without quote  Participants in a thread ( addresses in the “From:”)  Linear regression using all the above features  Logistic regression using all the above features

Overall Entropy=0.5*Cluster Entropy + 0.5*Class Entropy Cluster Quality Measures (He2002) Cluster EntropyClass Entropy Result Actual

Clustering Performance Cluster Entropy Class Entropy

Clustering Performance(2) Overall Entropy=0.53*Cluster Entropy *Class Entropy

Remaining Work Clustering  Find a more reasonable cluster quality measure  Study why sometimes learned similarity function performs worse than baseline  Find a better way to learn the similarity function Summarization  Divide it into two subtasks Summarization of announcement-driven discussion Summarization of question-driven discussion  Evaluation Create judgement files Evaluation measures