2009.04.27 - SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

Slides:

Advertisements

Similar presentations

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.

Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.

Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.

SLIDE 1FIST Shanghai Digging Into Data: Data Mining for Information Access Ray R. Larson University of California, Berkeley Paul Watry.

ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.

Grid & Libraries, 10/18/04.1 Second Invitational Berkeley – Academia Sinica Grid Digital Libraries Workshop, Taipei, October 18, 2004 Grid Middleware Application.

1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,

Information Retrieval in Practice

Search Engines and Information Retrieval

Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.

PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.

SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.

SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday.

Overview of Search Engines

Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.

SLIDE 1ISGC Taipei, Taiwan Grid-based Search and Data Mining Using Cheshire3 In collaboration with Robert Sanderson University of Liverpool.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

Search Engines and Information Retrieval Chapter 1.

SLIDE 1IS 240 – Spring 2013 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS May, 2008 National e-Science Centre Edinburgh Dr Robert.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.

Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.

1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.

Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.

SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,

1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.

ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.

Grid Architecture William E. Johnston Lawrence Berkeley National Lab and NASA Ames Research Center (These slides are available at grid.lbl.gov/~wej/Grids)

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida

Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.

1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,

Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.

CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.

SLIDE 1INFOSCALE Hong Kong Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Paul Watry Richard Marciano.

Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.

Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.

Commission on Cyberinfrastructure for the Humanities and Social Sciences Metadata as Infrastructure, Interoperability, and the Larger Context Michael Buckland,

1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

June 3-6, 2003E-Society Lisbon Automatic Metadata Discovery from Non-cooperative Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science.

Developing GRID Applications GRACE Project

SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.

Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre

SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information.

Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,

Information Retrieval in Practice

Joseph JaJa, Mike Smorul, and Sangchul Song

Dept. of Computer Science University of Liverpool

Data Mining Chapter 6 Search Engines

Introduction to Information Retrieval

Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy

Chaitali Gupta, Madhusudhan Govindaraju

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture 21: Grid-Based IR

SLIDE 2IS 240 – Spring 2009 Mini-TREC Proposed Schedule –February 16 - Database and previous Queries –March 2 - report on system acquisition and setup –March 9 - New Queries for testing –April 21 - Results due (let me know where your result files are located) –April 27 - Evaluation results and system rankings returned –May 11 - Group reports and discussion

SLIDE 3IS 240 – Spring 2009 All Minitrec Runs

SLIDE 4IS 240 – Spring 2009 All Groups – Best Runs

SLIDE 5IS 240 – Spring 2009 All Groups – Best Runs + RRL

SLIDE 6IS 240 – Spring 2009 Results Data trec_eval runs for each submitted file have been put into a new directory called RESULTS in your group directories The trec_eval parameters used for these runs are “-o” for the “.res” files and “-o –q” for the “.resq” files. The “.dat” files contain the recall level and precision values used for the preceding plots The qrels for the Mini-TREC queries are available now in the /projects/i240 directory as “MINI_TREC_QRELS”

SLIDE 7IS 240 – Spring 2009 Mini-TREC Reports In-Class Presentations May 8 th Written report due May 8 th (Last day of Class) – 4-5 pages Content –System description –What approach/modifications were taken? –results of official submissions (see RESULTS) –results of “post-runs” – new runs with results using MINI_TREC_QRELS and trec_eval

SLIDE 8IS 240 – Spring 2009 Term Paper Should be about 8-15 pages on: – some area of IR research (or practice) that you are interested in and want to study further –Experimental tests of systems or IR algorithms –Build an IR system, test it, and describe the system and its performance Due May 8 th (Last day of class)

SLIDE 9IS 240 – Spring 2009 Today Review –Web Search Engines Web Search Processing –Cheshire III Design Credit for some of the slides in this lecture goes to Eric Brewer

SLIDE 10IS 240 – Spring 2009

SLIDE 11IS 240 – Spring 2009

SLIDE 12IS 240 – Spring 2009

SLIDE 13IS 240 – Spring 2009

SLIDE 14IS 240 – Spring 2009

SLIDE 15IS 240 – Spring 2009 Grid-based Search and Data Mining Using Cheshire3 In collaboration with Robert Sanderson University of Liverpool Department of Computer Science Presented by Ray R. Larson University of California, Berkeley School of Information

SLIDE 16IS 240 – Spring 2009 Overview The Grid, Text Mining and Digital Libraries –Grid Architecture –Grid IR Issues Cheshire3: Bringing Search to Grid-Based Digital Libraries –Overview –Grid Experiments –Cheshire3 Architecture –Distributed Workflows

SLIDE 17IS 240 – Spring 2009 Grid middleware Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion.…. Portals Remote sensors..… Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.)

SLIDE 18IS 240 – Spring 2009 Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Grid middleware Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion Humanities computing Digital Libraries … Portals Remote sensors Text Mining Metadata management Search & Retrieval … Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture (ECAI/AS Grid Digital Library Workshop) Bio-Medical

SLIDE 19IS 240 – Spring 2009 Grid-Based Digital Libraries Large-scale distributed storage requirements and technologies Organizing distributed digital collections Shared Metadata – standards and requirements Managing distributed digital collections Security and access control Collection Replication and backup Distributed Information Retrieval issues and algorithms

SLIDE 20IS 240 – Spring 2009 Grid IR Issues Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed) Very large-scale distribution of resources is a challenge for sub-second retrieval Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive In many ways Grid IR replicates the process (and problems) of metasearch or distributed search

SLIDE 21IS 240 – Spring 2009 Introduction Cheshire History: –Developed at UC Berkeley originally –Solution for library data (C1), then SGML (C2), then XML –Monolithic applications for indexing and retrieval server in C + TCL scripting Cheshire3: –Developed at Liverpool, plus Berkeley –XML, Unicode, Grid scalable: Standards based –Object Oriented Framework –Easy to develop and extend in Python

SLIDE 22IS 240 – Spring 2009 Introduction Today: –Version –Mostly stable, but needs thorough QA and docs –Grid, NLP and Classification algorithms integrated Near Future: –June: Version 1.0 Further DM/TM integration, docs, unit tests, stability –December: Version 1.1 Grid out-of-the-box, configuration GUI

SLIDE 23IS 240 – Spring 2009 Context Environmental Requirements: –Very Large scale information systems Terabyte scale (Data Grid) Computationally expensive processes (Comp. Grid) Digital Preservation Analysis of data, not just retrieval (Data/Text Mining) Ease of Extensibility, Customizability (Python) Open Source Integrate not Re-implement "Web 2.0" – interactivity and dynamic interfaces

SLIDE 24IS 240 – Spring 2009 Context Data Grid Layer Data Grid SRB iRODS Digital Library Layer Application Layer Web Browser Multivalent Dedicated Client User Interface Apache+ Mod_Python+ Cheshire3 Protocol Handler Process Management Kepler Cheshire3 Query Results Query Results ExportParse Document Parsers Multivalent,... Natural Language Processing Information Extraction Text Mining Tools Tsujii Labs,... Classification Clustering Data Mining Tools Orange, Weka,... Query Results Search / Retrieve Index / Store Information System Cheshire3 User Interface MySRB PAWN Process Management Kepler iRODS rules Term Management Termine WordNet... Store

SLIDE 25IS 240 – Spring 2009 Cheshire3 Object Model UserStore User ConfigStore Object Database Query Record Transformer Records Protocol Handler Normaliser IndexStore Terms Server Document Group Ingest Process Documents Index RecordStore Parser Document Query ResultSet DocumentStore Document PreParser Extracter

SLIDE 26IS 240 – Spring 2009 Object Configuration One XML 'record' per non-data object Very simple base schema, with extensions as needed Identifiers for objects unique within a context (e.g., unique at individual database level, but not necessarily between all databases) Allows workflows to reference by identifier but act appropriately within different contexts. Allows multiple administrators to define objects without reference to each other

SLIDE 27IS 240 – Spring 2009 Grid Focus on ingest, not discovery (yet) Instantiate architecture on every node Assign one node as master, rest as slaves. Master then divides the processing as appropriate. Calls between slaves possible Calls as small, simple as possible: (objectIdentifier, functionName, *arguments) Typically: ('workflow-id', 'process', 'document-id')

SLIDE 28IS 240 – Spring 2009 Grid Architecture Master Task Slave Task 1 Slave Task N Data Grid GPFS Temporary Storage (workflow, process, document) fetch document document extracted data

SLIDE 29IS 240 – Spring 2009 Grid Architecture - Phase 2 Master Task Slave Task 1 Slave Task N Data Grid GPFS Temporary Storage (index, load) store index fetch extracted data

SLIDE 30IS 240 – Spring 2009 Workflow Objects Written as XML within the configuration record. Rewrites and compiles to Python code on object instantiation Current instructions: –object –assign –fork –for-each –break/continue –try/except/raise –return –log (= send text to default logger object) Yes, no if!

SLIDE 31IS 240 – Spring 2009 Workflow example workflow.SimpleWorkflow Unparsable Record ”Loaded Record:” + input.id

SLIDE 32IS 240 – Spring 2009 Text Mining Integration of Natural Language Processing tools Including: –Part of Speech taggers (noun, verb, adjective,...) –Phrase Extraction –Deep Parsing (subject, verb, object, preposition,...) –Linguistic Stemming (is/be fairy/fairy vs is/is fairy/fairi) Planned: Information Extraction tools

SLIDE 33IS 240 – Spring 2009 Data Mining Integration of toolkits difficult unless they support sparse vectors as input - text is high dimensional, but has lots of zeroes Focus on automatic classification for predefined categories rather than clustering Algorithms integrated/implemented: –Perceptron, Neural Network (pure python) –Naïve Bayes (pure python) –SVM (libsvm integrated with python wrapper) –Classification Association Rule Mining (Java)

SLIDE 34IS 240 – Spring 2009 Data Mining Modelled as multi-stage PreParser object (training phase, prediction phase) Plus need for AccumulatingDocumentFactory to merge document vectors together into single output for training some algorithms (e.g., SVM) Prediction phase attaches metadata (predicted class) to document object, which can be stored in DocumentStore Document vectors generated per index per document, so integrated NLP document normalization for free

SLIDE 35IS 240 – Spring 2009 Data Mining + Text Mining Testing integrated environment with 500,000 medline abstracts, using various NLP tools, classification algorithms, and evaluation strategies. Computational grid for distributing expensive NLP analysis Results show better accuracy with fewer attributes:

SLIDE 36IS 240 – Spring 2009 Applications (1) Automated Collection Strength Analysis Primary aim: Test if data mining techniques could be used to develop a coverage map of items available in the London libraries. The strengths within the library collections were automatically determined through enrichment and analysis of bibliographic level metadata records. This involved very large scale processing of records to: –Deduplicate millions of records –Enrich deduplicated records against database of 45 million –Automatically reclassify enriched records using machine learning processes (Naïve Bayes)

SLIDE 37IS 240 – Spring 2009 Applications (1) Data mining enhances collection mapping strategies by making a larger proportion of the data usable, by discovering hidden relationships between textual subjects and hierarchically based classification systems. The graph shows the comparison of numbers of books classified in the domain of Psychology originally and after enhancement using data mining

SLIDE 38IS 240 – Spring 2009 Applications (2) Assessing the Grade Level of NSDL Education Material The National Science Digital Library has assembled a collection of URLs that point to educational material for scientific disciplines for all grade levels. These are harvested into the SRB data grid. Working with SDSC we assessed the grade-level relevance by examining the vocabulary used in the material present at each registered URL. We determined the vocabulary-based grade-level with the Flesch-Kincaid grade level assessment. The domain of each website was then determined using data mining techniques (TF-IDF derived fast domain classifier). This processing was done on the Teragrid cluster at SDSC.

SLIDE 39IS 240 – Spring 2009 Cheshire3 Grid Tests Running on an 30 processor cluster in Liverpool using PVM (parallel virtual machine) Using 16 processors with one “master” and 22 “slave” processes we were able to parse and index MARC data at about records per second On a similar setup 610 Mb of TEI data can be parsed and indexed in seconds

SLIDE 40IS 240 – Spring 2009 SRB and SDSC Experiments We worked with SDSC to include SRB support We are planning to continue working with SDSC and to run further evaluations using the TeraGrid server(s) through a “small” grant for CPU hours – SDSC's TeraGrid cluster currently consists of 256 IBM cluster nodes, each with dual 1.5 GHz Intel® Itanium® 2 processors, for a peak performance of 3.1 teraflops. The nodes are equipped with four gigabytes (GBs) of physical memory per node. The cluster is running SuSE Linux and is using Myricom's Myrinet cluster interconnect network. Planned large-scale test collections include NSDL, the NARA repository, CiteSeer and the “million books” collections of the Internet Archive

SLIDE 41IS 240 – Spring 2009 Conclusions Scalable Grid-Based digital library services can be created and provide support for very large collections with improved efficiency The Cheshire3 IR and DL architecture can provide Grid (or single processor) services for next-generation DLs Available as open source via: