Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search Mathias Verbeke, Bettina Berendt,

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Clustering Basic Concepts and Algorithms
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.
Diversity in search: what, how, and what for? Bettina Berendt Dept. Computer Science, KU Leuven.
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Semantic Search Jiawei Rong Authors Semantic Search, in Proc. Of WWW Author R. Guhua (IBM) Rob McCool (Stanford University) Eric Miller.
© Tefko Saracevic, Rutgers University 1 EVALUATION in searching IR systems Digital libraries Reference sources Web sources.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Where does this new information belong? From developing mining algorithms to supporting knowledge discovery Bettina Berendt – thanks for joint work with.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
Faculty of Informatics and Information Technologies Slovak University of Technology Personalized Navigation in the Semantic Web Michal Tvarožek Mentor:
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
SciTech Strategies, Inc. BETTER MAPS BETTER DECISIONS Science Mapping and Applications: Choices and Trade-offs Kevin W. Boyack, SciTech Strategies Standards.
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
Search Engines and Information Retrieval Chapter 1.
Claudia Marzi Institute for Computational Linguistics, “Antonio Zampolli” – Italian National Research Council University of Pavia – Dept. of Theoretical.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Aardvark Anatomy of a Large-Scale Social Search Engine.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
1 1 Why and how is this a “related document”?: Semantics-based analysis of and navigation through heterogeneous text corpora Bettina Berendt & Daniel Trümper.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Towards an ecosystem of data and ontologies Mathieu d’Aquin and Enrico Motta Knowledge Media Institute The Open University.
Košice, 10 February Experience Management based on Text Notes The EMBET System Michal Laclavik.
June 12, 2008 The University of Mississippi Design Strategy for Knowledge Base Formation to Automate a Course Map Creation Susan Lukose
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 The k-means range algorithm for personalized data clustering.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Clustering Top-Ranking Sentences for Information Access Anastasios Tombros, Joemon Jose, Ian Ruthven University of Glasgow & University of Strathclyde.
Future Learning Landscapes Yvan Peter – Université Lille 1 Serge Garlatti – Telecom Bretagne.
Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
ARD Prasad Indian Statistical Institute, Bangalore.
Digital Learning India 2008 July , 2008 Mrs. C. Vijayalakshmi Department of Computer science and Engineering Indian Institute of Technology – IIT.
Use of FCA in the Ontology Extraction Step for the Improvement of the Semantic Information Retrieval Peter Butka TU Košice, Slovakia.
An Interactive System for CO-Citation Visualization Xia Lin Jan Buzydlowski Howard D. White Drexel University Philadelphia, PA, USA.
Conceptual structures in modern information retrieval Claudio Carpineto Fondazione Ugo Bordoni
Information Retrieval
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt.
Distance Education Network & Information Sciences Institute USC Viterbi School of Engineering Presented by Erin Shaw Research Computer Scientist Center.
Text Clustering Hongning Wang
A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.
Welcome to CPSC 534B: Information Integration Laks V.S. Lakshmanan Rm. 315.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
The whole world in the palm of your hand… Daniel A. Smith Alisdair Owens Alistair Russell Max Wilson Daniel A. Smith Alisdair Owens Alistair Russell Max.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Clustering of Web pages
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Information Retrieval and Web Search
Information Retrieval and Web Search
Semi-Automatic Data-Driven Ontology Construction System
Semantic Wikis Expedition #52 Conor Shankey CEO July 18, 2006
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search Mathias Verbeke, Bettina Berendt, Siegfried Nijssen Dept. Computer Science, KU Leuven

Agenda Motivation Motivation Diversity  Diversity-aware tools  (our) Context Main part Main part Measures of diversity  Tool Outlook Outlook

Motivation (1): Diversity is... Speaking different languages (etc.)  localisation / internationalisation Speaking different languages (etc.)  localisation / internationalisation Having different abilities  accessibility Having different abilities  accessibility Liking different things  collaborative filtering Liking different things  collaborative filtering Structuring the world in different ways  ? Structuring the world in different ways  ?

Motivation (2): Diversity-aware applications... Must have a (formal) notion of diversity Must have a (formal) notion of diversity Can follow a Can follow a –“personalization approach“  adapt to the user‘s value on the diversity variable(s)  transparently? Is this paternalistic? –“customization approach“  show the space of diversity  allow choice / semi-automatic!

(Our) Context 1. Diversity and Web usage: language, culture 2. Family of tools focussing on interactive sense- making helped by data mining –PORPOISE: global and local analysis of news and blogs + their relations –STORIES: finding + visualisation of “stories” in news –CiteseerCluster: literature search + sense-making –Damilicious: CiteseerCluster + re-use/transfer of semantics + diversity

Measuring grouping diversity Diversity = 1 – similarity = 1 - Normalized mutual information NMI = 0 NMI = 0.35 By colour &

Measuring user diversity “How similarly do two users group documents?“ “How similarly do two users group documents?“ For each query q, consider their groupings gr: For each query q, consider their groupings gr: “How similarly do two users group documents?“ “How similarly do two users group documents?“ For each query q, consider their groupings gr: For each query q, consider their groupings gr: For various queries: aggregate For various queries: aggregate

... and now: the application domain... that‘s only the 1st step!

Workflow Query Automatic clustering Manual regrouping Re-use 1. 1.Learn + present way(s) of grouping 2. 2.Transfer the constructed concepts

Concepts Extension Extension –the instances in a group Intension Intension –Ideally: “squares vs. circles“ –Pragmatically: defined via a classifier

Step 1: Retrieve CiteseerX via OAI Output: set of – –document IDs, – –document details – –their texts

Step 2: Cluster “the classic bibliometric solution“ CiteseerCluster: – –Similarity measure: co-citation, bibliometric coupling, word or LSA similarity, combinations – –Clustering algorithm: k-means, hierarchical Damilicious: phrases  Lingo How to choose the best“? How to choose the “best“? –Experiments: Lingo better than k-means at reconstruction and extension-over-time

Step 3 (a): Re-organise & work on document groups

Step 3 (b): Visualising document groups

Steps 4+5: Re-use Basic idea: Basic idea: 1.learn a classifier from the final grouping (Lingo phrases) 2.apply the classifier to a new search result  “re-use semantics“ Whose grouping? Whose grouping? –One‘s own –Somebody else‘s Which search result? Which search result? –“ the same“ (same query, structuring by somebody else) –“ More of the same“ (same query, later time  more doc.s) –“ related“ (... Measured how?...) –arbitrary

Visualising user diversity (1) Simulated users with different strategies U0: did not change anything (“System“) U0: did not change anything (“System“) U1: U1: tried produce a better fit of the document groups to the cluster intensions; 5 regroupings U2: attempted to move everything that did not fit well into the remainder group “Other topics”, & better fit; 10 regroupings U3: attempted to move everything from „Other topics“ into matching real groups; 5 regroupings U4: regrouping by author and institution; 5 regroupings  5*5 matrix of diversities gdiv(A,B,q)  multidimensional scaling

Visualising user diversity (2) aggregated using gdiv(A,B) Web mining Data mining RFID

Evaluating the application Clustering only: Does it generate meaningful document groups? Clustering only: Does it generate meaningful document groups? –yes (tradition in bibliometrics) – but: data? –Small expert evaluation of CiteseerCluster Clustering & regrouping Clustering & regrouping –End-user experiment with CiteseerCluster –5-person formative user study of Damilicious

Summary and (some) open questions Damilicious: a tool that helps users in sense-making, exploring diversity, and re-using semantics Damilicious: a tool that helps users in sense-making, exploring diversity, and re-using semantics diversity measures when queries and result sets are different? how to best present of diversity? – –How to integrate into an environment supporting user and community contexts (e.g., Niederée et al. 2005)? Incentives to use the functionalities? how to find the best balance between similarity and diversity? which measures of grouping diversity are most meaningful? – –Extensional? – –Intensional? Structure-based? Hybrid? (cf. ontology matching) which other sources of user diversity? Thanks !