Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee,

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
Chapter 5: Introduction to Information Retrieval
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Information Retrieval in Practice
Search Engines and Information Retrieval
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
© Franz Kurfess Project Topics 1 Topics for Master’s Projects and Theses -- Winter Franz J. Kurfess Computer Science Department Cal Poly.
1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3.
The Data Mining Visual Environment Motivation Major problems with existing DM systems They are based on non-extensible frameworks. They provide a non-uniform.
A Mobile World Wide Web Search Engine Wen-Chen Hu Department of Computer Science University of North Dakota Grand Forks, ND
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan Sep. 16, 2005.
Web Mining Research: A Survey
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Search Engines
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Data Mining Techniques
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Crowdsourcing Predictors of Behavioral Outcomes. Abstract Generating models from large data sets—and deter¬mining which subsets of data to mine—is becoming.
Search Engines and Information Retrieval Chapter 1.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
Markup and Validation Agents in Vijjana – A Pragmatic model for Self- Organizing, Collaborative, Domain- Centric Knowledge Networks S. Devalapalli, R.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
Chapter 6: Information Retrieval and Web Search
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Collaborative Information Retrieval - Collaborative Filtering systems - Recommender systems - Information Filtering Why do we need CIR? - IR system augmentation.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Data Mining for Web Intelligence Presentation by Julia Erdman.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Mercury – A Service Oriented Web-based system for finding and retrieving Biogeochemical, Ecological and other land- based data National Aeronautics and.
Effective Information Access Over Public Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Computational Tools for Population Biology Tanya Berger-Wolf, Computer Science, UIC; Daniel Rubenstein, Ecology and Evolutionary Biology, Princeton; Jared.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.
C. Lee Giles David Reese Professor, College of Information Sciences and Technology Graduate Professor of Computer Science and Engineering Courtesy Professor.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
KMS & Collaborative Filtering Why CF in KMS? CF is the first type of application to leverage tacit knowledge People-centric view of data Preferences matter.
Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Motivation Conclusion Effective Access Over Public Conversations William Lee, Hui Fang and Yifan Li University of Illinois at Urbana-Champaign Clustering.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Information Retrieval in Practice
Search Engine Architecture
Cloud based linked data platform for Structural Engineering Experiment
DataNet Collaboration
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Personal Assistants for the Web: An MIT Perspective
CSE 635 Multimedia Information Retrieval
Web Mining Department of Computer Science and Engg.
Inductive Clustering: A technique for clustering search results Hieu Khac Le Department of Computer Science - University of Illinois at Urbana-Champaign.
Presentation transcript:

Public Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Conversations William Lee, Hui Fang, Yifan Li, ChengXiang Zhai University of Illinois at Urbana-Champaign Clustering and Similarity Function Information within newsgroups or mailing lists has largely been underutilized. For now, access to those data restricted to traditional searching and browsing. Mail traffic also grows exponentially. Can we access those information more effectively? CEES = Conversation Extraction and Evaluation Service Provide a general framework for -related research Integrate with popular open source projects such as Lucene, Hibernate, Tapestry, Weka, and more… Features object-to-relational mapping of mail metadata, mail threading, flexible indexing, conversation clustering, and a web-base GUI Goal: Find commonly-discussed topics from a set of conversations (threads) Use agglomerative clustering with complete link Learn similarity functions from different “perspectives” of threads: –authors, date, subject, contents, contents without quote, first message, reply, reply without quote. –Use Linear and Logistic Regression to learn the combined similarity function Clustering –Learning the similarity function can be effective in combing different “perspectives” Conversation Map –Can give overview of a conversation group –Effective use of 2-D space Future Work –Derive better algorithms to learn the similarity function –Faster clustering algorithms that work on mining patterns in conversations –Summarization of conversation clusters Use 3 Computer Science class newsgroups from Univ. of Illinois at Urbana-Champaign for corpus Three different human taggers to group messages into subtopics 3-way cross validation, using one group’s judgment file as training set and test on the other two. Use class and cluster entropy as comparison metric Search Browse Existing technologies Conversation visualization derived from Treemap Clusters sorted by two extra time dimensions: Intra- Cluster Time and Inter-Cluster Time Allows user to adjust the similarity threshold -- “zoom” to the more similar threads

Autonomic Data Integration Systems Autonomic Computing – Shift workload from administrators & developers onto system Self-tuning, self-maintaining, self-recovering from failures, self-improving – Examples: Self-tuning databases and query optimizers Self-profiling software for bug and bottleneck detection Self-recovering distributed systems Data Integration Systems – Many complex components: –Global schema –Sources, wrappers, and source schemas –Semantic mappings between source and global schemas – Currently built (semi-)manually in error-prone and laborious process – Extremely difficult to maintain over changing sources Build Autonomic Data Integration Systems!

The AIDA Project Improving Automatic Methods –Schema & ontology matching [SIGMOD-01, WWW-02, SIGMOD-04] –Entity matching & integration [IJCAI-03, IEEE Intelligent-03] –Global interface construction [SIGMOD-04] Reducing Costs of System Construction –Mass collaboration to build systems [WebDB-03, IJCAI-03] Monitoring Data Sources and Maintaining DI Systems –Recognition of changes in source data –Detection and repair of failures in DI system components Fast system deployment Minimal human effort Automatic adjustment to changes Continuous improvement The Focus of This Talk Automatic Integration of DAta

Shift workload from developers onto user population Build system accurately with low individual effort The MOBS Approach Mass COllaboration to Build Systems Automatic Techniques Developers User Population MOBS Form Recognition Attribute Matching Source Discovery System Initialization Query Translation

Title: Cost: Writer: Pub: Title: Author: Price: Year: Price: Title: Authors: Author: Title: Price: MOBS Applied to the Deep Web MOBS for Query Interface Matching 1. Decompose task into binary statements 2. Initialize small functioning system 3. Solicit and merge user answers to expand the system Statements for Matching “Writer” “Writer = Author” ? “Writer = Title” ? “Writer = Price” ?

How to Solicit User Answers Incentive Models Leverage a monopoly or better-service system Piggy-back on a helper application Deploy in a volunteer or community environment HOOP 0 0 Barnes & Noble Is this form a Book Sales source? Author Title Pub Price YES NO

How to Merge User Answers Bayesian Learning Use a dynamic Bayesian network as a generative feedback model Estimate user behavioral parameters from evaluation answers Converge statements from teaching answers Title: Cost: Writer: Pub: Title: Author: Price: Year: Price: Title: Authors: Author: Title: Price:

Form Recognition24 forms, 17 bookstore forms Interface Matching 17 interfaces, 155 attributes Average 9 attributes per interface Hub Discovery30 department sites, 30 hubs Data Extraction 26 homepages, 155 slots Average 6 slots per homepage Mini-CiteSeer17 homepages, 22 publication lists Average 1.3 publication lists per homepage Applicability of the MOBS Approach We Have Applied MOBS in Various Settings… –Scale: from a small community intranet to a highly trafficked website –Users: from cooperative expert volunteers to unpredictable novice users … and to Several DI Tasks –Deep Web: Form Recognition, Interface Matching –Surface Web: Hub Discovery, Data Extraction, Mini-CiteSeer

Simulation and Real-World Results NameHelper Application Duration & Status Current ProgressPrecisionRecall Avg User Workload Form Recognition DB course website, 132 undergrad students 5 days, Completed Completed 24/24 interfaces, Found 17 bookstore interfaces 1.0 (0.7 ML) 0.89 (0.89 ML) 7.4 answers Interface Matching DB course website, 132 undergrad students 7 days, Stopped Completed 10/17 interfaces, Matched 65 total attributes 0.97 (0.63 ML) 0.97 (0.63 ML) 12.5 answers Hub Discovery IR course website, 28 undergrad students 21 days, Stopped Completed 15/30 sites Found 15 hubs 0.87 (0.27 ML) 0.87 (0.27 ML) 16.1 answers Mini-Citeseer Google search engine, 21 researchers, friends, family 19 days, Completed Completed 17/17 pages (94 lists) Found 19 pubs answers P1 – Uniform [0,1] P2 – Uniform [0.3,0.7] P3 – Uniform [0.5,0.9] P4 – Bell [0,1] P5 – Bell [0.3,0.7] P6 – Bell [0.5,0.9] P7 – Bimodal {0.2,0.8} P8 – 90% Uniform [0,0.4], 10% {0.8} P9 – 10% {0.1}, 50% Uniform [0.5,0.7], 40% Uniform [0.8,1] P10 – 10% {0.3}, 90% Uniform [0.7,1]

Conclusion & Future Work MOBS is an Effective Data Integration Tool – Requires small start-up and administrative costs – Solicits minimal effort per user – Constructs system accurately – Complements existing DI techniques – Applies to various scenarios and DI domains Future Work – Leverage implicit feedback – Intelligently maintain system – More tightly integrate with existing DI techniques – Deploy compelling real-world applications