2012.06.12 SLIDE 1FIST 2012 - Shanghai Digging Into Data: Data Mining for Information Access Ray R. Larson University of California, Berkeley Paul Watry.

Slides:



Advertisements
Similar presentations
Panel 2 – Promoting Re-Use of Scientific Collections John Harrison SHAMAN Project University of Liverpool
Advertisements

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
Distributed Data Processing
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
San Diego Supercomputer Center Analyzing the NSDL Collection Peter Shin, Charles Cowart Tony Fountain, Reagan Moore San Diego Supercomputer Center.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CF21) IRNC Kick-Off Workshop July 13,
Grid & Libraries, 10/18/04.1 Second Invitational Berkeley – Academia Sinica Grid Digital Libraries Workshop, Taipei, October 18, 2004 Grid Middleware Application.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Search Engines and Information Retrieval
Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Biology.sdsc.edu CIPRes in Kepler: An integrative workflow package for streamlining phylogenetic data analyses Zhijie Guan 1, Alex Borchers 1, Timothy.
Digital Library Architecture and Technology
January, 23, 2006 Ilkay Altintas
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
Search Engines and Information Retrieval Chapter 1.
Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.
SLIDE 1IS 240 – Spring 2013 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
DISTRIBUTED COMPUTING
Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS May, 2008 National e-Science Centre Edinburgh Dr Robert.
Exploring the Applicability of Scientific Data Management Tools and Techniques on the Records Management Requirements for the National Archives and Records.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,
Grid Architecture William E. Johnston Lawrence Berkeley National Lab and NASA Ames Research Center (These slides are available at grid.lbl.gov/~wej/Grids)
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.
SLIDE 1INFOSCALE Hong Kong Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Paul Watry Richard Marciano.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Introduction to The Storage Resource.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano.
Commission on Cyberinfrastructure for the Humanities and Social Sciences Metadata as Infrastructure, Interoperability, and the Larger Context Michael Buckland,
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
SLIDE 1ISGC - Taiwan Grid-based Digital Libraries and Cheshire3 Ray R. Larson University of California, Berkeley School of Information.
Clouds , Grids and Clusters
VI-SEEM Data Discovery Service
Joseph JaJa, Mike Smorul, and Sangchul Song
Constructing a system with multiple computers or processors
Building Search Systems for Digital Library Collections
SCALABLE OPEN ACCESS Hussein Suleman
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
Chaitali Gupta, Madhusudhan Govindaraju
Presentation transcript:

SLIDE 1FIST Shanghai Digging Into Data: Data Mining for Information Access Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano University of Liverpool University of North Carolina, Chapel Hill

SLIDE 2 The idea behind the Digging into Data Challenge is to address how "big data" changes the research landscape for the humanities and social sciences Second round of an International (US, Canada, UK, Netherlands) collaboration of funders –Requires each project to represent at least two countries –Big Data – (but small funding) –Many contributed data sources available Report on DID 1: One Culture: Computationally Intensive Research in the Humanities and Social Sciences CLIR FIST Shanghai

SLIDE 3 Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Goals: –Text mining and NLP techniques to extract content (named Persons, Places, Time Periods/Events) and associate context Data: –Internet Archive Books Collection (with associated MARC where available) ~1.2T –Jstore ~1T –Context sources: SNAC Archival and Library Authority records. Tools –Cheshire 3 – DL Search and Retrieval Framework –iRODS – Policy-driven distributed data storage –CITRIS/IBM Cluster ~400 Cores FIST Shanghai

SLIDE 4FIST Shanghai Overview Digging Into Data overview The Grid and Digital Libraries Cheshire3: –Overview –Cheshire3 Architecture –Distributed Workflows –DataGrid Experiments

SLIDE 5FIST Shanghai Grid middleware Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion.…. Portals Remote sensors..… Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.)

SLIDE 6FIST Shanghai But… what about… Applications and data that are NOT for scientific research? Things like: –Humanities? –Social Sciences?

SLIDE 7FIST Shanghai Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Grid middleware Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion Humanities computing Digital Libraries … Portals Remote sensors Text Mining Metadata management Search & Retrieval … Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture (ECAI/AS Grid Digital Library Workshop) Bio-Medical

SLIDE 8FIST Shanghai Grid-Based Digital Libraries: Needs Large-scale distributed storage requirements and technologies Organizing distributed digital collections Shared Metadata – standards and requirements Managing distributed digital collections Security and access control Collection Replication and backup Distributed Information Retrieval support and algorithms

SLIDE 9 But… Hasn’t Hadoop and its menagerie already solved everything? –Yes – many tasks can be done now with great scaleup –And No – most Hadoop solutions are batch oriented and not geared towards information access, but more towards summarization –Maybe – we are looking at replacing or supplementing the low-level data management with Hadoop tools FIST Shanghai

SLIDE 10FIST Shanghai Grid/Cloud IR Issues Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed) Very large-scale distribution of resources is (still) a challenge for sub-second retrieval Different from most other typical Grid/Cloud processes, IR is potentially less computing intensive and more data intensive In many ways Grid IR replicates the process (and problems) of metasearch or distributed search We have developed the Cheshire3 system to evaluate and manage these issues. The Cheshire3 system is actually one component in a larger Grid-based environment

SLIDE 11FIST Shanghai Cheshire3 Environment or iRODS

SLIDE 12FIST Shanghai Cheshire3 Environment iRODS: integrated Rule-Oriented Data System DataGrid distributed storage systems for storing Large amounts of data. Originally Developed at San Diego Supercomputer Center now an open source platform with work at DICE (UNC) Advantages: Rule-based storage policy management including Replication Storage Resource Abstraction Logical identifiers vs 'physical' identifiers Mountable as a filesystem or iRODS

SLIDE 13FIST Shanghai Cheshire3 Environment Kepler/Ptolemy Workflow processing environment developed at UC Berkeley (Ptolemy) and SDSC (Kepler) plus others including LLNL, UCSD and University of Zurich. Director/Actor model: Actors perform tasks together as directed. Workflow environments, such as Kepler, are designed to allow researchers to design and execute flexible processing sequences for complex data analysis They provide a Graphical User Interface to allow any level of user from a variety of disciplines to design these workflows in a drag-and-drop manner This provides a platform can integrate text mining techniques and methodologies, either as part of an internal Cheshire workflow, or as external workflow configured using a Kepler

SLIDE 14FIST Shanghai C3 Major Use Cases The Cheshire system is being used in the UK National Text Mining Centre (NaCTeM) as a primary means of integrating information retrieval systems with text mining and data analysis systems NARA Prototype which demonstated use of the Cheshire3 environment for indexing and retrieval in a preservation environment. Included a web crawl of all information related to the Columbia Shuttle disaster NSDL Analysis to analyse 200GB of web-crawled data from the NSDL (National Science Digital Library) and analyse each document for grade level based on vocabulary. We are using LSI and Cluster analysis to categorize the crawled documents CURL Data Million records of library bibliographic data from major research libraries in the UK

SLIDE 15FIST Shanghai Cheshire Digital Library System Cheshire was originally created at UC Berkeley and more recently co-developed at the University of Liverpool. The system itself is widely used in the United Kingdom for production digital library services including: –Archives Hub –JISC Information Environment Service Registry –Resource Discovery Network –British Library ISTC service The Cheshire system has recently gone through a complete redesign into its current incarnation, Cheshire3 enabling Grid-based IR over the Data Grid

SLIDE 16FIST Shanghai Cheshire3 IR Overview XML Information Retrieval Engine –3rd Generation of the UC Berkeley Cheshire system, as co- developed at the University of Liverpool –Uses Python for flexibility and extensibility, but uses C/C++ based libraries for processing speed –Standards based: XML, XSLT, CQL, SRW/U, Z39.50, OAI to name a few –Grid capable. Uses distributed configuration files, workflow definitions and PVM or MPI to scale from one machine to thousands of parallel nodes –Free and Open Source Software –

SLIDE 17 Cheshire3 Object Model FIST Shanghai

SLIDE 18FIST Shanghai Cheshire3 Object Model UserStore User ConfigStore Object Database Query Record Transformer Records Protocol Handler Normaliser IndexStore Terms Server Document Group Ingest Process Documents Index RecordStore Parser Document Query ResultSet DocumentStore Document PreParser Extracter

SLIDE 19FIST Shanghai Object Configuration Each non Data Object has an XML configuration. Common base schema with extensions as needed. Configurations can be treated as a Record. Store them in regular RecordStores Access/Distribute them via regular IR protocols (Requires a 'bootstrap' to find the configuration for the configStore) Each object has a 'pseudo-unique' identifier. Unique within the current context (server, database, etc) Can re-apply identifiers at a lower level Workflows are objects in all of the above ways

SLIDE 20FIST Shanghai Cheshire3 Workflows Cheshire3 workflows are a simple and nonstandard XML definition Intentional: The workflows are specific to the Cheshire3 architecture Also dependent on the architecture They replace lines of boring code required for every new database Most importantly, they replace lines of code in distributed processing Need to be easy to understand Need to be easy to create How do workflows help us in massively parallel processing?

SLIDE 21FIST Shanghai Distributed Processing Each node in the cluster instantiates the configured architecture, potentially through a single ConfigStore Master nodes then run a high level workflow to distribute the processing amongst Slave nodes by reference to a subsidiary workflow As object interaction is well defined in the model, the result of a workflow is equally well defined. This allows for the easy chaining of workflows, either locally or spread throughout the cluster

SLIDE 22FIST Shanghai Teragrid Experiments We worked with SDSC to run evaluations using the TeraGrid through two “small” grants for CPU hours each – SDSC's TeraGrid cluster currently consists of 256 IBM cluster nodes, each with dual 1.5 GHz Intel® Itanium® 2 processors, for a peak performance of 3.1 teraflops. The nodes are equipped with four gigabytes (GBs) of physical memory per node. The cluster is running SuSE Linux and is using Myricom's Myrinet cluster interconnect network Large-scale test collections now include MEDLINE, NSDL, the NARA preservation prototype, and the CURL bibliographic data, and we hope to use CiteSeer and the “million books” collections of the Internet Archive Using 100 machines, we processed 1.8 million Medline records at a sustained rate of 15,385 per second. With all 256 machines, taking into account additional process management overhead, we could index the entire 16 million record collection in around 7 minutes. Using 32 machines, we processed 16 million bibliographic records at a rate of 35,700 records per second. This equates to real time searching of the Library of Congress.

SLIDE 23FIST Shanghai Teragrid Indexing Master1 iRODS JSTOR Slave1SlaveN File Paths File Path1 File PathN Object1ObjectN GPFS Temp Storage Extracted Data1Extracted DataN

SLIDE 24FIST Shanghai Teragrid Indexing: Slave NLP Tagger MVD Document Parser XML Parser Data Cleaning XPath Extraction etc. Proximity Noun/Verb Filter Phrase Detection Maste r1 iRODS JSTOR Slave 1 Slave N GPFS Temp Storage

SLIDE 25FIST Shanghai Teragrid Indexing 2 Master1 iRODS JSTOR Slave1SlaveN Sort/Load Request Merged Data GPFS Temp Storage Extracted Data

SLIDE 26FIST Shanghai Search Phase Web Interface iRODS JSTOR Multivalent Browser Index Sections SRW Search Request SRB URIs Berkeley Liverpool & UNC Liverpool In order to locate matching records, the web interface retrieves the relevant chunks of index from the SRB on demand.

SLIDE 27FIST Shanghai Search Phase2 iRODS JSTOR Multivalent Browser SRB URI of Object Original Object

SLIDE 28FIST Shanghai Summary Indexing and IR work very well in the Grid environment, with the expected scaling behavior for multiple processes Still in progress: –We are collecting the complete (English) books collection from the Internet Archive –We are extracting place names, personal names, corporate names and linking with reference sources (such as LOC Name Authorities, VIAF and SNAC)

SLIDE 29FIST Shanghai Thank you! Available via