Evaluating Hierarchical Clustering of Search Results Departamento de Lenguajes y Sistemas Informáticos UNED, Spain Juan Cigarrán Anselmo Peñas Julio Gonzalo.

Slides:



Advertisements
Similar presentations
Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.
Advertisements

Distinción semántica de compuestos léxicos en Recuperación de Información Anselmo Peñas, Julio Gonzalo y Felisa Verdejo Dpto. Lenguajes y Sistemas Informáticos,
1 Very Large-Scale Incremental Clustering Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Terminology Retrieval: towards a synergy between thesaurus and free text searching Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas.
Corpus-based Terminology Extraction applied to Information Access Anselmo Peñas, Felisa Verdejo and Julio Gonzalo NLP Group, Dpto. Lenguajes y Sistemas.
Conceptual Graph Analysis Chapter 20 Lori Nuth | EDIT 730 | Fall 2005.
Browsing by phrases: terminological information in interactive multilingual text retrieval Anselmo Peñas, Julio Gonzalo and Felisa Verdejo NLP Group, Dpto.
La indexación con técnicas lingüísticas en el modelo clásico de Recuperación de Información Julio Gonzalo, Anselmo Peñas y Felisa Verdejo Grupo de Procesamiento.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Personalized Search Result Diversification via Structured Learning
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
Digital Library Service Integration (DLSI) --> Looking for Collections and Services to be DLSI Testbeds
Presented by Zeehasham Rasheed
Supporting Information Needs by Ostensive Definition in an Adaptive Information Space Iain Campbell 1995 Gretchen Schwarz.
IR Models: Review Vector Model and Probabilistic.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Knowing what you get for what you pay An introduction to cost effectiveness FETP India.
Personalized Information Retrieval in Context David Vallet Universidad Autónoma de Madrid, Escuela Politécnica Superior,Spain.
JASS 2005 Next-Generation User-Centered Information Management Information visualization Alexander S. Babaev Faculty of Applied Mathematics.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
Departamento de Lenguajes y Sistemas Informáticos escuela técnica superior de ingeniería informática Business Family Engineering Does it make sense ? Ildefonso.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003 Presentation Serhiy Polyakov DSCI 5240 Fall.
Comparing syntactic semantic patterns and passages in Interactive Cross Language Information Access (iCLEF at the University of Alicante) Borja Navarro,
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Search Engine Architecture
University of Malta CSA3080: Lecture 3 © Chris Staff 1 of 18 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
QA Pilot Task at CLEF 2004 Jesús Herrera Anselmo Peñas Felisa Verdejo UNED NLP Group Cross-Language Evaluation Forum Bath, UK - September 2004.
What Does the User Really Want ? Relevance, Precision and Recall.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Acceso a la información mediante exploración de sintagmas Anselmo Peñas, Julio Gonzalo y Felisa Verdejo Dpto. Lenguajes y Sistemas Informáticos UNED III.
By R. O. Nanthini and R. Jayakumar.  tools used on the web to find the required information  Akeredolu officially described the Web as “a wide- area.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Efficient Semantic Web Service Discovery in Centralized and P2P Environments Dimitrios Skoutas 1,2 Dimitris Sacharidis.
How to search for relevant information. Preparing to search: PLAN WHAT am I looking for? WHY do I want it? WHEN? Time period? HOW? Document type? What.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
WHIM- Spring ‘10 By:-Enza Desai. What is HCIR? Study of IR techniques that brings human intelligence into search process. Coined by Gary Marchionini.
Connecting Interface Metaphors to Support Creation of Path-based Collections Unmil P. Karadkar, Andruid Kerne, Richard Furuta, Luis Francisco-Revilla,
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Retrieval languages some general concepts LIS 677.
Search Engine Architecture
Multimedia Information Retrieval
IR Theory: Evaluation Methods
[Human Memory] 10.Knowledge
Qingxia Liu Interactive Hierarchical Tag Clouds for Summarizing Spatiotemporal Social Contents [ICDE 2014] Kang, Wei, Anthony KH Tung,
Evaluating Information Retrieval Systems
Search Engine Architecture
Navigation-Aided Retrieval
A Suite to Compile and Analyze an LSP Corpus
Filtering Properties of Entities By Class
Access to HE Standardisation
Information Retrieval and Web Design
Recommending Adaptive Changes for Framework Evolution
Presentation transcript:

Evaluating Hierarchical Clustering of Search Results Departamento de Lenguajes y Sistemas Informáticos UNED, Spain Juan Cigarrán Anselmo Peñas Julio Gonzalo Felisa Verdejo nlp.uned.es SPIRE 2005, Buenos Aires

Overview Scenario Assumptions Features of a Good Hierarchical Clustering Evaluation Measures –Minimal Browsing Area (MBA) –Distillation Factor (DF) –Hierarchy Quality (HQ) Conclusion

Scenario Complex information needs –Compile information from different sources –Inspect the whole list of documents More than 100 documents Help to –Find the relevant topics –Discriminate from unrrelevant documents Approach –Hierarchical Clustering – Formal Concept Analysis

Problem How to define and measure the quality of a hierarchical clustering? How to compare different clustering approaches?

Previous assumptions Each cluster contains only those documents fully described by its descriptors d1d2d3d4 PhysicsXXXX Nuclear physics XX AstrophysicsX d2, d3 d1 Physics Astrophysics d4 Nuclear physics d2, d3 d1, d2, d3, d4 Physics Astrophysics d4 Nuclear physics

Previous assumptions ‘Open world’ perspective d1d2d3 PhysicsXX JokesXX Jokes about physics X d1 Physics Jokes about physics d3 Jokes d2 Jokes about physics d3 d1 Physics Jokes d2 Jokes about physics d3

Good Hierarchical Clustering The content of the clusters. –Clusters should not mix relevant with non relevant information

Good Hierarchical Clustering The hierarchical arrangement of the clusters –Relevant information should be in the same path

Good Hierarchical Clustering The number of clusters –Number of clusters substantially lower than the number of documents How clusters are described –Cognitive load of reading a cluster description –Ability to predict the relevance of the information that it contains (not addressed here)

Evaluation Measures Criterion –Minimize the browsing effort for finding ALL relevant information Baseline –The original document list returned by a search engine

Evaluation Measures Consider –Content of clusters –Hierarchical arrangement of clusters –Size of the hierarchy –Cognitive load of reading a document (in the baseline): K d –Cognitive load of reading a node descriptor (in the hierarchy): K n Requirement –Relevance assessments are available

Minimal Browsing Area (MBA) The minimal set of nodes the user has to traverse to find ALL the relevant documents minimising the number of irrelevant ones

Distillation Factor (DF)  Ability to isolate relevant information compared with the original document list (Gain Factor, DF>1)  Considers only the cognitive load of reading documents  Equivalent to:

Distillation Factor (DF) Example DF(L) = 7/5 = 1.4 Doc 1+ Doc 2- Doc 3+ Doc 4+ Doc 5- Doc 6- Doc 7+ Document List Precision = 4/7Precision MBA = 4/

Distillation Factor (DF) Counterexample: Precision = 4/8 Precision MBA = 4/4 DF = 8/4 = 2 Bad clustering with good DF Extend the DF measure considering the cognitive cost of taking browsing decisions  HQ

Hierarchy Quality (HQ) Assumption: –When a node (in the MBA) is explored, all its lower neighbours have to be considered: some will be in turn explored, some will be discarded –N view : subset of lower neighbours of each node belonging to the MBA MBA |N view |=8

Hierarchy Quality (HQ)  K n and K d are directly related with the retrieval scenario in which the experiments take place  The researcher must tune K=K n /K d before conducting the experiment  HQ > 1 indicates an improvement of the clustering versus the original list

Hierarchy Quality (HQ) Example

Conclusions and Future Work Framework for comparing different clustering approaches taking into account: –Content of clusters –Hierarchical arrangement of clusters –Cognitive load to read document and node descriptions Adaptable to the retrieval scenario in which experiments take place Future work –Conduct user studies to compare their results with the automatic evaluation Results will reflect the quality of the descriptors Will be used to fine-tune the k d and k n parameters

Thank you!