Universität Stuttgart Universitätsbibliothek Information Retrieval on the Grid? Results and suggestions from Project GRACE Werner Stephan Stuttgart University.

Slides:



Advertisements
Similar presentations
1 Ontolog OOR Use Case Review Todd Schneider 1 April 2010 (v 1.2)
Advertisements

The DRIVER Infrastructure (Digital Repository Infrastructure Vision for European Research) Paolo Manghi ISTI - National Research Council, Italy.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
The Experience Factory May 2004 Leonardo Vaccaro.
Information Retrieval in Practice
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
GGF Toronto Spitfire A Relational DB Service for the Grid Peter Z. Kunszt European DataGrid Data Management CERN Database Group.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
DataGrid Kimmo Soikkeli Ilkka Sormunen. What is DataGrid? DataGrid is a project that aims to enable access to geographically distributed computing power.
Chapter 19: Information Retrieval
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Knowledge Portals and Knowledge Management Tools
Overview of Search Engines
A Comparsion of Databases and Data Warehouses Name: Liliana Livorová Subject: Distributed Data Processing.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
© 2003, Prentice-Hall Chapter Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Break Out Session on Infrastructure and Technology: A Report Vipul Kashyap AOS Workshop, Rome, 15 November 2001
Interoperability Scenario Producing summary versions of compound multimedia historical documents.
GRACE Project IST EGAAP meeting – Den Haag, 25/11/2004 Giuseppe Sisto – Telecom Italia Lab.
Search Engines and Information Retrieval Chapter 1.
1 The BT Digital Library A case study in intelligent content management Paul Warren
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Introduction: Databases and Database Users
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
7. Approaches to Models of Metadata Creation, Storage and Retrieval Metadata Standards and Applications.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
The SEE-GRID initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no SE4SEE A Grid-Enabled Search.
LRI Université Paris-Sud ORSAY Nicolas Spyratos Philippe Rigaux.
Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.
A GRID solution for Gravitational Waves Signal Analysis from Coalescing Binaries: preliminary algorithms and tests F. Acernese 1,2, F. Barone 2,3, R. De.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Louisa Casely-Hayford e-Science The ISIS Facilities Ontology and OntoMaintainer Louisa Casely-Hayford and Shoaib Sufi.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
MIT Artificial Intelligence Laboratory — Research Directions The START Information Access System Boris Katz
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu and Gagan Agrawal Enabling.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Developing GRID Applications GRACE Project
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
The Agricultural Ontology Server (AOS) A Tool for Facilitating Access to Knowledge AGRIS/CARIS and Documentation Group Food and Agriculture Organization.
WP5 – Infrastructure Operations Test and Production Infrastructures StratusLab kick-off meeting June 2010, Orsay, France GRNET.
InSilicoLab – Grid Environment for Supporting Numerical Experiments in Chemistry Joanna Kocot, Daniel Harężlak, Klemens Noga, Mariusz Sterzel, Tomasz Szepieniec.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Regional Operations Centres Core infrastructure Centres
CCNT Lab of Zhejiang University
Information Retrieval
Information Retrieval
Context Interoperability Submission Search Preservation
Data Mining Chapter 6 Search Engines
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

Universität Stuttgart Universitätsbibliothek Information Retrieval on the Grid? Results and suggestions from Project GRACE Werner Stephan Stuttgart University Library IATUL 2005

2 Universität Stuttgart Universitätsbibliothek Project Participants

3 Universität Stuttgart Universitätsbibliothek What is Information Retrieval? Retrieval of unstructured, textual information typically stored in various document formats  Unstructured: does not include, for example, metadata in Dublin Core Simple  Textual: does not include numeric data (like produced by experimental instrumentation in High Energy Physics)  Document formats: does not include database federation

4 Universität Stuttgart Universitätsbibliothek Information Retrieval  Typical approach: indexing  Pre-processing text applying - at least to some extent - natural language processing  Resulting index stored in a format optimized for rapid querying  Exotic approaches:  Post-retrieval processing (typically in meta-search)  Concept indexing (similar to manual keyword annotation, only automatic)  Information retrieval is “text crunching”

5 Universität Stuttgart Universitätsbibliothek What is an ontology? A list, sometimes hierarchical, of agreed upon subject headings Example: Title: Travel in Britain Subject: Tourism--UK Title: Sightseeing around England Subject: Tourism--UK A uniform descriptor for a single concept

6 Universität Stuttgart Universitätsbibliothek What is concept based indexing?  Concept based indexing is similar to manual keyword annotation  Automatic process on paragraph level  Uses an existing ontology

7 Universität Stuttgart Universitätsbibliothek Is Grid a Solution?  “Text crunching” is a heavy computational task  Resulting indices of enormous size  What is grid good at?  Batch pre-processing (e.g., text indexing)  Massive data storage (e.g., indices)  Grid - an ideal solution for pre-processing (indexing) documents for information retrieval

8 Universität Stuttgart Universitätsbibliothek Limitations of the Current Grids  Interaction with end-user  Complex installation and maintenance  Complex certification  Grid job failure rate  Weak monitoring  Primitive distributed data management

9 Universität Stuttgart Universitätsbibliothek Grid in the GRACE Context  Infrastructure for Enabling Grids for E-Science in Europe (EGEE) in Italy  GILDA (Grid INFN Laboratory for Dissemination Activities)  Run by Istituto Nazionale Fisica Nucleare (INFN)  Based on the Large Hadron Collider Computing Grid LCG-2  Grace testbed INFN-Grid  2 nodes  Turin (5 CPUs, 300 GB Storage Space)  Milan (4 CPUs, 250 GB Storage Space)

10 Universität Stuttgart Universitätsbibliothek Content Sources on Grid  Scarce content sources on Grid  Abundant content sources on WWW  Web-based content sources distributed  Meta-search is required to use web-based content sources  In meta-search scenario document texts not available for pre-processing (not accessible without submitting a query)

11 Universität Stuttgart Universitätsbibliothek How to Acquire Content Sources?  Harvesting instead of crawling  Ontology concepts used as queries submitted to multiple content sources  Downloaded document processed  Indices stored on grid  Queries repeatedly submitted to keep information updated

12 Universität Stuttgart Universitätsbibliothek GRACE  High Energy Physics Keyword Index (HEP)  Not a full-fledged ontology  Useful terminology  Targeted content sources:  CERN Document Server (CDS)  Google Scholar

13 Universität Stuttgart Universitätsbibliothek Pre-processing Content  Keyword indexing (like regular search engines)  Concept indexing (using HEP terminology)  Categorization  Extracting additional lexical patterns  Clustering documents accordingly

14 Universität Stuttgart Universitätsbibliothek Concept based indexing in GRACE -- Knowledge Domains The user can choose from a variety of subjects and content sources

15 Universität Stuttgart Universitätsbibliothek User enters a search term Search results are indexed according to the HEP (High Energy Physics) Keyword Index

16 Universität Stuttgart Universitätsbibliothek Automatic Categories in addition to concept based indexing GRACE also creates its own categories based on the content of the resulting documents

17 Universität Stuttgart Universitätsbibliothek  Interoperability layer: single-point access to multiple content sources (similar to database federation)  Unified presentation of information originating from multiple document formats and languages  Structured and concise view of large amounts of information (similar to data warehouse) Concept based indexing plus categorization as a suitable retrieval concept for a grid based application

18 Universität Stuttgart Universitätsbibliothek Benefits  Extends sharing of computational and storage resources to knowledge resources  Allows members with limited resources to join forces in order to build powerful and hand-tailored information retrieval solutions  Brings the collaboration to the level of “thinking together”

19 Universität Stuttgart Universitätsbibliothek Related Initiatives  GGF GIR-WG  Promotes information retrieval as issue for grid  Assumes that content sources are available on grid  IBM “Masala”  IBM DB2 on grid  Similar: distributed data sources  Different: structured data (data warehousing)

Distributed Content Sources Distributed Databases GRACE Search Engine Data WarehouseContent Index & Categorization

21 Universität Stuttgart Universitätsbibliothek Expectations from Future Grids  Improved interactivity with end-user (high hopes in WSRF: grid services to be used just as Web Services are today)  Simple installation and maintenance  Simple certification  Bullet-proof grid jobs  Effective monitoring  Highly efficient distributed data management (storage elements exposed as databases with uniform schema)

22 Universität Stuttgart Universitätsbibliothek Future Research Directions  Maximize use of ontologies for information retrieval on grid  Extend use of ontologies for information presentation  Improve harvesting through use of ontologies  Tighten integration with grid infrastructure  Follow closely advances in standardization  Dependence on the grid infrastructure  Extend integration of GRACE with Replica Manager