1 BINGO! and Daffodil: Personalized Exploration of Digital Libraries and Web Sources Martin Theobald Max-Planck-Institut für Informatik Claus-Peter Klas.

Slides:



Advertisements
Similar presentations
1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:
Advertisements

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Future Software Architectures Combining the Web 2.0 with the Semantic Web to realize future Web Communities Maarten Visser
Information Retrieval in Practice
Search Engines and Information Retrieval
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Chapter 19: Information Retrieval
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
WWW and Internet The Internet Creation of the Web Languages for document description Active web pages.
Information Retrieval
Overview of Search Engines
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Search Engines and Information Retrieval Chapter 1.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
D AFFODIL Strategic Support Evaluated Claus-Peter Klas Norbert Fuhr Andre Schaefer University of Duisburg-Essen.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Search Engine Architecture
PEERSPECTIVE.MPI-SWS.ORG ALAN MISLOVE KRISHNA P. GUMMADI PETER DRUSCHEL BY RAGHURAM KRISHNAMACHARI Exploiting Social Networks for Internet Search.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Digital libraries and web- based information systems Mohsen Kamyar.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Digital Literacy Concepts and basic vocabulary. Digital Literacy Knowledge, skills, and behaviors used in digital devices (computers, tablets, smartphones)
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov 2, Birgitta.
Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
BINGO!: Bookmark-Induced Gathering of Information Sergej Sizov, Martin Theobald, Stefan Siersdorfer, Gerhard Weikum University of the Saarland Germany.
1 DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen, Germany.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,
Data mining in web applications
Search Engine Architecture
Information Retrieval
Data Mining Chapter 6 Search Engines
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Search Engine Architecture
Presentation transcript:

1 BINGO! and Daffodil: Personalized Exploration of Digital Libraries and Web Sources Martin Theobald Max-Planck-Institut für Informatik Claus-Peter Klas Universität Duisburg-Essen

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources 2 Overview Challenges in Web Recommendations for Digital Library Users DAFFODIL BINGO! Meta Data Processing System Integration via Web Services Experiments

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources 3 1.Recommendations for Digital Library Users Current recommendations systems are based on  User collaboration & mining of user behaviour  Pre-computed document similarities (clustering, cosine measure, co-citations, etc.) limited to a single DL index Challenges in providing Web recommendations  Well organized and structured Digital Library world vs. heterogeneous and chaotic Web  Unified view on related publications and topic-specific Web documents (e.g., authors’ homepages)  “Virtual links” from meta data through additional Deep Web search Topic-specific hub pages Full text sources for training  Filtering, user feedback, and iterative topic refinement

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources 4 DL 1 DL 2 DL k... Service... Service Many access points, e.g., DBLP, CiteSeer, ACM, Springer, Achilles, HCIBIB Different query formats & services Insufficient coverage Insufficient functionality 2. DAFFODIL: Motivation

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources Combination of Services & Information Virtual Digital Library Service... Service One access point Integration of DLs and individual services Meta data extraction Synergy effects Strategic information search Information organization in personal folders DAFFODIL DL 1 DL 2 DL k...

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources 6 Information Sources Structured Personal Lib Interpret Retrieved Information Create Knowledge Retrieve Information 2.2. Lifecycle of Scientific Search as an Interactive and Iterative Process

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources Personal Library Publications organized as structured meta data Digital Library Objects (DLOs)  Authors, title  Web links  Journals  Conferences  Keywords, abstracts  Queries Meta data export in XML or BibTex Basis for information exchange with BINGO!

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources 8 3. BINGO! Focused crawler  Combination of traditional crawler with classifier and link analyzer for topic distillation  Crawls mainly within the densely connected neighbourhood graph for the topics of interest  Filters out irrelevant documents directly at the crawler frontier Information portal generation  Initialized by a set of training documents (e.g., bookmarks)  Extends and maintains a user-specific topic-directory (Yahoo!) Expert queries  Initialized by a single-topic keyword query  Exploits existing index or external search engine for retrieving initial hubs  training & seed  Improves recall

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources 9 World Wide Web Crawler Document analyzer Feature selection SVM classifier HITS link- analyzer URL queue Feature vectors, TF*IDF Hubs & Authorities Taxonomy index Topic- specific features 3.1. Focused Crawler Architecture Training documents Training Iterative (Re-)Training, Feedback

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources Meta Data Processing DLOs do provide  Comprehensive meta data describing the publication DLOs may provide  Links to full text sources (e.g., PS, PDF or DOC - files)  Links to homepages, DL portals, or other Web references Focused crawler requires  Full text documents for feature extraction and training  Topic-specific hubs as start pages of focused crawl Solution: Additional Deep Web queries using DLO contents Web Service Framework (WSF)  Replace form submission method of the browser (HTTP post) with new WS Detection of attribute labels/names next to form fields Form types determine attribute types (e.g., String, Date, etc.)  Generation of WSDL description for WS parameters  Automatic deployment under local Tomcat Web Server

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources 11 Advanced search form of the DBLP Trier (HTML and tags ) WSDL description registered at local UDDI WSF

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources Ontology-Enabled Form Matching & Portal Selection Ontology service  Incorporates WordNet, Cyc, and personalized domain ontologies  Mapping of attribute names to semantic concepts  Quantified relationships between concepts  Approximate matches between similar attributes in a DLO and WSDL Portal selection  Aggregation of attribute similarities between a DLO and WSDL  Ranking for the top matching portals to each meta data query Unified handling of heterogeneous portals  Pre-computed (static) mapping on schema level only  Automatic form completion and submission

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources 13 DLO extracted from Achilles WSDL for DBLP Trier Synonym1.0 Hypernym0.8

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources System Integration Web Service Infrastructure  Two standalone Tomcat Web Servers for WS routing Daffodil at Universität Duisburg Bingo! + WSF at MPII Saarbrücken Asynchronous coupling  Distributed multi-user requests  Queuing of recommendation requests at Daffodil (FIFO)  Integration of Bingo! into Daffodil’s agent-based structure Simple communication protocol (via RPCs)  1) Daffodil: Init request with basic crawler parameters  2) Bingo!: Fetch folders along with user id & timestamp  3) Bingo!: Return top recommendations

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources 15 1) init DL 1 DL k Web Portal 1 Web Portal m... BINGO! Bookmark-Induced Gathering of Information DAFFODIL Distributed Agents for User- Friendly Access of Digital Libraries Web Service Infrastructure 2) fetch ++ Full text sources Core DB Web IR Work- flow Training Docs Web Service Framework Web Service Framework Ontology- Enabled Portal Selector Ontology- Enabled Portal Selector Focused Crawler Focused Crawler 3) return Core DB Author =... Title = Web IR Work- flow DLOs, Meta data User Folders BINGO! Agent for Recommendations BINGO! Agent for Recommendations

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources 16 WSF: DBLP Trier, HCIBIB, Achilles, NCSTRL, Springer, Google adv. 6. Experiments: User Folder for “Workflow”

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources First Iteration & Feedback

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources Second Iteration

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources Conclusions & Outlook Coupling of Bingo! and Daffodil provides significantly added value for advanced users  For Daffodil: Context-based search over the Web  For Bingo!: Access to semi-structured meta data and the Deep Web Future work:  More exhaustive experimental evaluation  Extensive study of relevance feedback  DLO extraction from recommended URLs  Personalized crawling as a publicly available service for Daffodil users (dedicated server)

Theobald, Klas: Personalized Exploration of Digital Libraries and Web Sources 20 Try it out !