Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Slides:



Advertisements
Similar presentations
Yansong Feng and Mirella Lapata
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Modern Information Retrieval Chapter 1: Introduction
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Text Classification With Support Vector Machines
Image Search Presented by: Samantha Mahindrakar Diti Gandhi.
ADVISE: Advanced Digital Video Information Segmentation Engine
Information Retrieval February 24, 2004
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Tutorial support.ebsco.com. Welcome to Explora, EBSCO’s engaging interface for schools and public libraries. Designed to meet the unique needs of its.
Scalable Text Mining with Sparse Generative Models
12 -1 Lecture 12 User Modeling Topics –Basics –Example User Model –Construction of User Models –Updating of User Models –Applications.
1/39 Tools to “Think With” UW Knowledge Works: A Content Management System in Teaching and Learning Aaron Louie, Information Architect William Washington,
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Information Seeking Processes and Models Dr. Dania Bilal IS 530 Fall 2007.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Personalization of the Digital Library Experience: Progress and Prospects Nicholas J. Belkin Rutgers University, USA
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
The Cognitive Perspective in Information Science Research Anthony Hughes Kristina Spurgin.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2006.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
D AFFODIL Strategic Support Evaluated Claus-Peter Klas Norbert Fuhr Andre Schaefer University of Duisburg-Essen.
Amy Dai Machine learning techniques for detecting topics in research papers.
CSM06 Information Retrieval Lecture 6: Visualising the Results Set Dr Andrew Salway
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
인지구조기반 마이닝 소프트컴퓨팅 연구실 박사 2 학기 박 한 샘 2006 지식기반시스템 응용.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Supporting document use through interactive visualization of metadata Visual Interfaces to Digital Libraries JCDL 28/06/2001 Mischa Weiss-Lijn.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
يادگيري ماشين Machine Learning Lecturer: A. Rabiee
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Combining Unsupervised Feature Selection.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
1 FollowMyLink Individual APT Presentation First Talk February 2006.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Document Clustering for Natural Language Dialogue-based IR (Google for the Blind) Antoine Raux IR Seminar and Lab Fall 2003 Initial Presentation.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
User Errors in Formulating Queries and IR Techniques to Overcome Them Birger Larsen Information Interaction and Information Architecture Royal School of.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Presented by Archana Kumari ( ) | Supervised By Mr Vikram Singh
Digital Video Library - Jacky Ma.
Information Organization: Overview
Proposal for Term Project
Tutorial support.ebsco.com.
Tutorial support.ebsco.com.
Prepared by: Mahmoud Rafeek Al-Farra
Visualizing Document Collections
Document Clustering Matt Hughes.
CS246: Information Retrieval
Information Organization: Overview
Presentation transcript:

Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell

Justification for fuzzy document classification Fuzzy aims….how can you know exactly what you’re looking for when you don’t know the possibilities? “anomalous state of knowledge” (Belkin et al 1982) So fuzzy clusters reflect the cognitive state

Hypothesis: Fuzzy results clustering and visualization should save time by directing searchers to the level of results that they wish to view (rather than breaking off arbitrarily at screen bottom) …in a prototype digital library for paleontology Research overview

Talk overview Background: often classification with algorithms alone, de-emphasizing document Approach: * Facets and browse categories * Metadata generation * Classifier algorithms * Visualization: labels and color grid Findings from paleontologist experiments positive response to our fuzzy classification muted response to our fuzzy visualization

Background: fuzzy clustering Text classification is well-researched (Sebastiani, 2002 review). It depends on algorithm used (k-nearest neighbor, naïve bayes, support vector, etc.) and on document representation (bag of words, or with natural language processing factors) Our work differs from others’ in its emphasis on document representation which we hoped would provide greater precision.

Background: fuzzy info visualization “Research in visualisation of fuzzy systems is still at an early stage” (Pham and Brown, 2003) -- location on the page— with the top being most relevant (see left, ours) -- 3D -- icons (see left) -- color gradations with dark most relevant (ours)

Pre-set queries: facets based on user needs

Queries are supported by controlled vocabulary, or ontology

Metadata generation: classification according to article rhetoric (could be improved)

Knowledge Engineering rather than machine learning for small document set Rules for finding matches of document to query Example: Ma [number] Mya [number] Myr [number] B.P [number] in document matches to associated time periods Rules for clustering documents into fuzzy categories (requires metadata generation) Example: *** Highly relevant if match found in title or abstract ** Relevant if match found in caption…

To solve problem of showing uncertainty clusters in a familiar list

To solve problem of showing more results per screen as well as showing clusters

Participants: 3 paleontologists (undergraduate, graduate and museum curator) Method: Compare classifications of people and system for same articles Sample: 30 articles, mix of training and non- training set articles, from 3 categories: gingko (3 levels relevancy), allosaurus (3 levels relevancy), neither RESULTS: 70% agreed at least 1/3 of participant ratings Participant experiments (algorithm testing)

Pilot testing with paleontologist in our group Paleontology conferences: –Spring 2009 NACP (North American Paleontological Convention) – 17 returned –Fall 2009 SVP (Society of Vertebrate Paleontologists) Ask 3 graduate or undergraduates in paleontology to classify the articles – results not yet returned Questionnaires –Spring questionnaire: design focus –Fall: comparative focus (features as well as design) RESULTS 58.8% liked our labels 35.7% liked our grid Participant experiments (interface testing)

Future directions To improve fuzzy classification: adapt CiteSeer parse algorithm to improve our classification To improve visualization: list view with labels and colors for uncertainty levels

Contributions in summary (1)Fuzzy result groupings represent “fuzzy” concept of search aim as in user’s mind, so uncertainly labels are appreciated (2)Fuzzy color blocks that represent abstract categories are not liked; stick to minor modifications of the familiar list

References Belkin, N.J., Oddy, R.N. and Brooks H.M. (1982) ASK for information retrieval,. Part I: Background and theory; Part II: Results of a design study, Journal of Documentation, vol. 3, no. 2&3, pp : , Pham, B. & Brown, R. (2003). Analysis of visualization requirement for fuzzy systems. Proceedings of the 1st international conference on computer graphics and interactive techniques in Australasia and South East Asia, Melbourne, Australia, 181 ff. Sebastiani, (2002) Machine learning in automated text categorization, ACM Computing Surveys, 34 (1), 1-47.