Presentation is loading. Please wait.

Presentation is loading. Please wait.

Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee,

Similar presentations


Presentation on theme: "Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee,"— Presentation transcript:

1 Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee, Hui Fang, Yifan Li, ChengXiang Zhai University of Illinois at Urbana-Champaign Clustering and Similarity Function Information within newsgroups or mailing lists has largely been underutilized. For now, access to those data restricted to traditional searching and browsing. Mail traffic also grows exponentially. Can we access those information more effectively? CEES = Conversation Extraction and Evaluation Service Provide a general framework for email-related research Integrate with popular open source projects such as Lucene, Hibernate, Tapestry, Weka, and more… Features object-to-relational mapping of mail metadata, mail threading, flexible indexing, conversation clustering, and a web-base GUI Goal: Find commonly-discussed topics from a set of conversations (threads) Use agglomerative clustering with complete link Learn similarity functions from different “perspectives” of threads: –authors, date, subject, contents, contents without quote, first message, reply, reply without quote. –Use Linear and Logistic Regression to learn the combined similarity function Clustering –Learning the similarity function can be effective in combing different “perspectives” Conversation Map –Can give overview of a conversation group –Effective use of 2-D space Future Work –Derive better algorithms to learn the similarity function –Faster clustering algorithms that work on mining patterns in conversations –Summarization of conversation clusters Use 3 Computer Science class newsgroups from Univ. of Illinois at Urbana-Champaign for corpus Three different human taggers to group messages into subtopics 3-way cross validation, using one group’s judgment file as training set and test on the other two. Use class and cluster entropy as comparison metric Search Browse Existing technologies Conversation visualization derived from Treemap Clusters sorted by two extra time dimensions: Intra- Cluster Time and Inter-Cluster Time Allows user to adjust the similarity threshold -- “zoom” to the more similar threads

2 Autonomic Data Integration Systems Autonomic Computing – Shift workload from administrators & developers onto system Self-tuning, self-maintaining, self-recovering from failures, self-improving – Examples: Self-tuning databases and query optimizers Self-profiling software for bug and bottleneck detection Self-recovering distributed systems Data Integration Systems – Many complex components: –Global schema –Sources, wrappers, and source schemas –Semantic mappings between source and global schemas – Currently built (semi-)manually in error-prone and laborious process – Extremely difficult to maintain over changing sources Build Autonomic Data Integration Systems!

3 The AIDA Project Improving Automatic Methods –Schema & ontology matching [SIGMOD-01, WWW-02, SIGMOD-04] –Entity matching & integration [IJCAI-03, IEEE Intelligent-03] –Global interface construction [SIGMOD-04] Reducing Costs of System Construction –Mass collaboration to build systems [WebDB-03, IJCAI-03] Monitoring Data Sources and Maintaining DI Systems –Recognition of changes in source data –Detection and repair of failures in DI system components Fast system deployment Minimal human effort Automatic adjustment to changes Continuous improvement The Focus of This Talk Automatic Integration of DAta

4 Shift workload from developers onto user population Build system accurately with low individual effort The MOBS Approach Mass COllaboration to Build Systems Automatic Techniques Developers User Population MOBS Form Recognition Attribute Matching Source Discovery System Initialization Query Translation

5 Title: Cost: Writer: Pub: Title: Author: Price: Year: Price: Title: Authors: Author: Title: Price: MOBS Applied to the Deep Web MOBS for Query Interface Matching 1. Decompose task into binary statements 2. Initialize small functioning system 3. Solicit and merge user answers to expand the system Statements for Matching “Writer” “Writer = Author” ? “Writer = Title” ? “Writer = Price” ?

6 How to Solicit User Answers Incentive Models Leverage a monopoly or better-service system Piggy-back on a helper application Deploy in a volunteer or community environment 3 1 2 HOOP 0 0 Barnes & Noble Is this form a Book Sales source? Author Title Pub Price YES NO

7 How to Merge User Answers Bayesian Learning Use a dynamic Bayesian network as a generative feedback model Estimate user behavioral parameters from evaluation answers Converge statements from teaching answers Title: Cost: Writer: Pub: Title: Author: Price: Year: Price: Title: Authors: Author: Title: Price:

8 Form Recognition24 forms, 17 bookstore forms Interface Matching 17 interfaces, 155 attributes Average 9 attributes per interface Hub Discovery30 department sites, 30 hubs Data Extraction 26 homepages, 155 slots Average 6 slots per homepage Mini-CiteSeer17 homepages, 22 publication lists Average 1.3 publication lists per homepage Applicability of the MOBS Approach We Have Applied MOBS in Various Settings… –Scale: from a small community intranet to a highly trafficked website –Users: from cooperative expert volunteers to unpredictable novice users … and to Several DI Tasks –Deep Web: Form Recognition, Interface Matching –Surface Web: Hub Discovery, Data Extraction, Mini-CiteSeer

9 Simulation and Real-World Results NameHelper Application Duration & Status Current ProgressPrecisionRecall Avg User Workload Form Recognition DB course website, 132 undergrad students 5 days, Completed Completed 24/24 interfaces, Found 17 bookstore interfaces 1.0 (0.7 ML) 0.89 (0.89 ML) 7.4 answers Interface Matching DB course website, 132 undergrad students 7 days, Stopped Completed 10/17 interfaces, Matched 65 total attributes 0.97 (0.63 ML) 0.97 (0.63 ML) 12.5 answers Hub Discovery IR course website, 28 undergrad students 21 days, Stopped Completed 15/30 sites Found 15 hubs 0.87 (0.27 ML) 0.87 (0.27 ML) 16.1 answers Mini-Citeseer Google search engine, 21 researchers, friends, family 19 days, Completed Completed 17/17 pages (94 lists) Found 19 pubs 1.00.868.7 answers P1 – Uniform [0,1] P2 – Uniform [0.3,0.7] P3 – Uniform [0.5,0.9] P4 – Bell [0,1] P5 – Bell [0.3,0.7] P6 – Bell [0.5,0.9] P7 – Bimodal {0.2,0.8} P8 – 90% Uniform [0,0.4], 10% {0.8} P9 – 10% {0.1}, 50% Uniform [0.5,0.7], 40% Uniform [0.8,1] P10 – 10% {0.3}, 90% Uniform [0.7,1]

10 Conclusion & Future Work MOBS is an Effective Data Integration Tool – Requires small start-up and administrative costs – Solicits minimal effort per user – Constructs system accurately – Complements existing DI techniques – Applies to various scenarios and DI domains Future Work – Leverage implicit feedback – Intelligently maintain system – More tightly integrate with existing DI techniques – Deploy compelling real-world applications http://anhai.cs.uiuc.edu/home/projects/aida.html


Download ppt "Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee,"

Similar presentations


Ads by Google