Architecture for graphical maps of Web contents Krzysztof Ciesielski, Michal Draminski, Mieczyslaw Klopotek, Mariusz Kujawiak, Slawomir Wierzchon Institute.

Slides:



Advertisements
Similar presentations
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Advertisements

AskMe A Web-Based FAQ Management Tool Alex Albu. Background Fast responses to customer inquiries – key factor in customer satisfaction Costs for customer.
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
To See, or Not to See—Is That the Query? Robert R. Korfhage Dept. of Information Science University of Pittsburgh 1991 Reviewed by Yi-Bu Chen LIS 551 Information.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Information Retrieval in Practice
Architecture of a Search Engine
Unsupervised Image Clustering using Probabilistic Continuous Models and Information Theoretic Principles Shiri Gordon Electrical Engineering – System,
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
Relevance Feedback based on Parameter Estimation of Target Distribution K. C. Sia and Irwin King Department of Computer Science & Engineering The Chinese.
1998/5/21by Chang I-Ning1 ImageRover: A Content-Based Image Browser for the World Wide Web Introduction Approach Image Collection Subsystem Image Query.
A Mobile World Wide Web Search Engine Wen-Chen Hu Department of Computer Science University of North Dakota Grand Forks, ND
Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
A fuzzy video content representation for video summarization and content-based retrieval Anastasios D. Doulamis, Nikolaos D. Doulamis, Stefanos D. Kollias.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Search Engines
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Enhancing Internet Search Engines to Achieve Concept- based Retrieval F. Lu, T. Johnsten, V. Raghavan, and D. Traylor.
© URENIO Research Unit 2004 URENIO Online Benchmarking Application Thessaloniki 7 th of October 2004 Isidoros Passas BEng Computer System Engineering.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Mining document maps Mieczyslaw Klopotek Slawomir Wierzchon Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish.
Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof Ciesielski Institute of Computer Science, Poland.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Mihir Daptardar Software Engineering 577b Center for Systems and Software Engineering (CSSE) Viterbi School of Engineering 1.
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
An Introduction to Computer Science. CSE Studies How Computers Work and How to Make Them Work Better Architecture  Designing machines Programming languages.
Master Thesis Defense Jan Fiedler 04/17/98
Patterns, effective design patterns Describing patterns Types of patterns – Architecture, data, component, interface design, and webapp patterns – Creational,
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Chapter Chapter 3 Internet Agents. Chapter Contents Background Web Search Agents Information Filtering Agents Notification Agents Other Service.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
From Social Bookmarking to Social Summarization: An Experiment in Community-Based Summary Generation Oisin Boydell, Barry Smyth Adaptive Information Cluster,
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Mapping document collections in non-standard geometries Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Effective Information Access Over Public Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.
Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.
Multi-object Similarity Query Evaluation Michal Batko.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
Vector and symbolic processors
L&I SCI 110: Information science and information theory Instructor: Xiangming(Simon) Mu Sept. 9, 2004.
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE Bruno Pinheiro Renato Correa
Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer.
Document Clustering for Natural Language Dialogue-based IR (Google for the Blind) Antoine Raux IR Seminar and Lab Fall 2003 Initial Presentation.
Usenet World Jakob Metzler. Motivation “People confuse neighborhood with community” (Barry Wellman: The Network Community) Maybe we can make community.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
General Architecture of Retrieval Systems 1Adrienn Skrop.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Information Retrieval in Practice
Efficient Multi-User Indexing for Secure Keyword Search
Search Engine Architecture
Clustering medical and biomedical texts – document map based approach
Enhancing Internet Search Engines to Achieve Concept-based Retrieval
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Database Systems Instructor Name: Lecture-3.
Information Retrieval and Web Design
Presentation transcript:

Architecture for graphical maps of Web contents Krzysztof Ciesielski, Michal Draminski, Mieczyslaw Klopotek, Mariusz Kujawiak, Slawomir Wierzchon Institute of Computer Science, PAS, Warsaw University of Podlasie, Siedlce Białystok University of Technology

Agenda MotivationArchitecture Map interface Map creation Map clustering Execution time of map creation Convergence of map creation Future direction

Motivation the Web and also intranets become increasingly content-rich a good way of presenting massive document sets in an understandable way will be crucial in the near future. The BEATCA project envisages creation of a user-friendly content presentation of moderate size document collections (with millions of documents).

Our approach The presentation method is based on the WebSOM's map idea and is enriched with novel methods of document analysis, clustering and visualization. A special architecture has been elaborated to enable experiments with various brands of map creation algorithm. Our research targets at creation of a full-fledged search engine (with working name Beatca) for small collections of documents capable of representing on-line replies to queries in graphical form on a document map.

Architecture We follow the general architecture for search engines, the preparation of documents for retrieval is done by an indexer, which turns the HTML etc. representation of a document into a vector-space model representation, the map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation, Maps are used by the query processor responding to user's queries.

Architecture HT Base Vector Base Map Robot Indexer Mapper Search Engine HT Base Base Registry Indexer Map Mapper Vector Base Optimizer

User interface Search results are presented on a document map The map can have one of two forms: –The traditional flat map –The rotating torus

Rotating torus representation of the map

How are the maps created A modified WebSOM method is used Based on our observation of radical reduction of document vector variation Multi-level maps

A map for 20 newsgroups

A detailed map for Syskill&Webert 4 document groups

A high level map for Syskill&Webert 4 document groups

Clustering groups documents A fuzzy isodata method used Entropy based Initialisation with Minimum weight spanning tree Clustered documents are labeled by weighed centroids of cell reference vectors modified with entropy

Approximate clustering using minimal spanning tree for 5 newsgroups

Label candi- dates for clusters (5 news- groups) Word RankCluster #1 sci.math Cluster #2 sci.med / sci.math Cluster #3 talk. religion misc (a) Cluster #4 soc. culture. israel Cluster #5 comp. windows.x Cluster #6 talk. religion misc (b) 1 dieciphermenisraelbootfunding 2 probableblockrapedpalestinianwindowsstudy 3 theorystreamwomengunfilestaxes 4 registerskeychildrenazizmenusstock 5 mathematicsotpchildiraqislibhealth 6 equationalgorithmssexkoppeliconmarket 7 krhsmsocisraelilabelsocial 8 cossimonfatherjewsfoldermercer 9 sequencecombinationspaternityresolutionmsvcrtdgoverning 10 texshenfeministoliverpcrvaccinations 11 spacedistinctiontrollingutahdaffydmeasurement 12 gravitationalencryptionwhitejohncshortcutss 13 waveepimethiuslibnranetzeroduke 14 latexrandomnessengland1991objquantum 15 pdfsmartcardsupportfirearmstabjama 16 macentropywomansettlementskernelhopems 17 filesyahooblackpalestineduckbushes 18 israelicibrotherpermittedinstalledcomputer 19 debtmodelchatgisbackupcompanies 20 unsignedlotterymediairaqdesktopdiabetes

Experiments with execution time The impact of the following factors on the speed o9f map creation was investigated: Map size Optimization method –Dictionary optimization (extreme entropy and extreme frequency) –Reference vector optimization

Convergence We checked the convergence of the maps to a stable state depending on Type of alpha function (search radius reduction) Type of winner search method

Future research We intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects. Bayesian networks will be applied in particular to classify documents, to accelerate document clustering processes, to construct a thesaurus supporting query enrichment, and to keyword extraction. Immuno-genetic systems will be used for adaptive document clustering by referring to the mechanism of so- called metadynamics, for extraction of compact characteristics of document groups by exploitation of the mechanism of construction of universal and specialized antibodies, and for visualisation and adjustment of resolution of document maps.

Thank you Any questions?