Scatter/Gather : A Cluster Based Approach to Large Document Collections Alyssa Katz LIS 551 March 23, 2003.

Slides:



Advertisements
Similar presentations
Jane Long, MA, MLIS Reference Services Librarian Al Harris Library.
Advertisements

A Vector Space Model for Automatic Indexing
Cyborg Categorization The Basics Tom Reamy Knowledge Architect Intranet Consultant.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 1 Section 2 – Slide 1 of 22 Chapter 1 Section 2 Observational Studies, Experiments, and.
Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Dynamic Layout Optimization for Newspaper Web Sites using a Controlled Annealed Genetic Algorithm Gjermund Brabrand H06MMT.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
1998/5/21by Chang I-Ning1 ImageRover: A Content-Based Image Browser for the World Wide Web Introduction Approach Image Collection Subsystem Image Query.
A Task Oriented Non- Interactive Evaluation Methodology for IR Systems By Jane Reid Alyssa Katz LIS 551 March 30, 2004.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Information Retrieval – Introduction and Survey Norbert Fuhr University of Duisburg-Essen Germany
Cyborg Categorization Salvation for Search? Tom Reamy Information Architect Charles Schwab © 2001 Charles Schwab & Co., Inc., member NYSE/SIPC. All rights.
Literature Search Techniques 2 Strategic searching In this lecture you will learn: 1. The function of a literature search 2. The structure of academic.
By Kousar Taj A Seminar Paper on LITERATURE REVIEW.
People Fractions. Problem 1 of 20 Answer 1 = 10 Problem 2 of 20 Answer 2 = 5.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
Chapter 9 Writing Reports
Chapter 9 Writing Reports
Academic Research to Support Arguments.
Year 11 Unit 2 – Controlled assessment (25%)
Research Question: state it Resident Name, PGY Mentor: Name.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
JASS 2005 Next-Generation User-Centered Information Management Information visualization Alexander S. Babaev Faculty of Applied Mathematics.
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc – SIAM Web Analytics Workshop.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Business and Management Research WELCOME. Lecture 4.
Assignment Comments. Report Structure Abstract Write at end Not just a list of contents – Not a to do list – Not a to be done list Summary of article.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.
Displaying Dynamic Information Jaime Teevan * Massachusetts Institute of Technology * The General ProblemThe General Solution Are these.
Applying Genetic Algorithm to the Knapsack Problem Qi Su ECE 539 Spring 2001 Course Project.
LAM II 08 24_28 TAB 1 ESSENTIAL QUESTIONS. TAB 1(OF 5) ESSENTIAL QUESTIONS ALL ESSENTIAL QUESTIONS, LISTED BY DATE 5 LINE DISCUSSION SPACE PER ENTRY NO.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
Encyclopaedia Idea1 New Library Feature Proposal 22 The Encyclopaedia.
Understanding the Flow of Content in Summarizing HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Sampling The complete set of people or objects that information is collected from is called the population. Information is normally taken from a small.
DOCUMENT CLUSTERING USING HIERARCHICAL ALGORITHM Submitted in partial fulfillment of requirement for the V Sem MCA Mini Project Under Visvesvaraya Technological.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
Multiplication Facts Table of Contents 0’s 1’s 2’s 3’s 4’s 5’s 6’s 7’s 8’s 9’s 10’s.
Ontology-based fuzzy event extraction agent for Chinese e- news summarization Expert Systems with Applications Volume: 25, Issue: 3, October, 2003, pp.
As Of March 28 th, 2001 A quick summary of LeNDI / Celware Integration. rbp.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Sullivan – Statistics: Informed Decisions Using Data – 2 nd Edition – Chapter 1 Section 2 – Slide 1 of 21 Chapter 1 Section 2 Observational Studies, Experiments,
Toward Semantic Search: RDFa based facet browser Jin Guang Zheng Tetherless World Constellation.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Document Clustering for Natural Language Dialogue-based IR (Google for the Blind) Antoine Raux IR Seminar and Lab Fall 2003 Initial Presentation.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Studio 4. Project Planning and Literature Review SPRING 2016 GE105 Introduction to Engineering Design College of Engineering King Saud University.
1.3 Experimental Design. What is the goal of every statistical Study?  Collect data  Use data to make a decision If the process to collect data is flawed,
Item-to-Item Recommender Network Optimization
Information Organization: Overview
<Student’s name>
Research Question: state it
Chartboost Help Site Competitive Analysis and Proposal
Technological Design Process Unit 1 Lesson 4
Information Organization: Overview
Presentation transcript:

Scatter/Gather : A Cluster Based Approach to Large Document Collections Alyssa Katz LIS 551 March 23, 2003

Introduction Alternate uses for document clustering Alternate uses for document clustering Give document clustering a second chance! Give document clustering a second chance!

Old Approach Compare Document Clustering with Vector Space Models Compare Document Clustering with Vector Space Models Cluster searches are for the most part inferior to VS searchesCluster searches are for the most part inferior to VS searches Document clustering algorithms are SLOWDocument clustering algorithms are SLOW CONCLUSION: Document clustering should only be used to the extent of accelerating VS searches CONCLUSION: Document clustering should only be used to the extent of accelerating VS searches

New Approach Document Clustering is not bad, just misunderstood Document Clustering is not bad, just misunderstood The REAL question is: How can clustering be effective in its own right? The REAL question is: How can clustering be effective in its own right? THE ANSWER: The “Scatter/Gather Method” THE ANSWER: The “Scatter/Gather Method”

Searching vs. Browsing Specific information need Specific information need User has good idea of keywords or search terms User has good idea of keywords or search terms Faster, more pointed Faster, more pointed User wants more general info User wants more general info Is not familiar with the vocabulary, or doesn’t want to commit to a specific set of words Is not familiar with the vocabulary, or doesn’t want to commit to a specific set of words User will sift through info to find what he wants User will sift through info to find what he wants

Solution Use clustering to browse a system the way one would browse a table of contents Use clustering to browse a system the way one would browse a table of contents Have a function where user can alternate between browsing and searching Have a function where user can alternate between browsing and searching

Scatter/Gather User is presented with short summaries of a small number of document groups. User is presented with short summaries of a small number of document groups. User selects one or more groups for further study User selects one or more groups for further study Continue this process until the individual document level Continue this process until the individual document level

Example 5000 Articles in the NYT News Service 5000 Articles in the NYT News Service International News Kuwait and Germany and Oil Articles about effect of invasion on oil market, U.S. Military deployment in Kuwait Document

Requirements New Algorithms New Algorithms One that can appropriately cluster large document collectionsOne that can appropriately cluster large document collections One that can sufficiently generate summaries of these document collectionsOne that can sufficiently generate summaries of these document collections

Solution Buckshot algorithm for the first requirement Buckshot algorithm for the first requirement Employs a random sampling of clustersEmploys a random sampling of clusters Fractionation for the second requirement Fractionation for the second requirement

Application to Scatter/Gather Basically, clustering is done beforehand, and real time searches do not cluster from scratch Basically, clustering is done beforehand, and real time searches do not cluster from scratch Real time searches just refine what already exists Real time searches just refine what already exists