2005.03.21 SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information.

Slides:



Advertisements
Similar presentations
DATE: 2008/03/11 NCHC-Grid Computing Portal (NCHC-GCE Portal) Project Manager: Dr. Weicheng Huang Developed Team: Chien-Lin Eric Huang Chien-Heng Gary.
Advertisements

Abstraction Layers Why do we need them? –Protection against change Where in the hourglass do we put them? –Computer Scientist perspective Expose low-level.
12 October 2011 Andrew Brown IMu Technology EMu Global Users Group 12 October 2011 IMu Technology.
Connecting Knowledge Silos using Federated Text Mining Guy Singh Senior Manager, Product & Strategic Alliances ©2014 Linguamatics Ltd.
SLIDE 1FIST Shanghai Digging Into Data: Data Mining for Information Access Ray R. Larson University of California, Berkeley Paul Watry.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
Grid & Libraries, 10/18/04.1 Second Invitational Berkeley – Academia Sinica Grid Digital Libraries Workshop, Taipei, October 18, 2004 Grid Middleware Application.
FROM INFORMATION, KNOWLEDGE Prof. Marti Hearst MIMS Visit Day, 2006 Some Research Projects.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Standard Web Search Engine Architecture
Ch 12 Distributed Systems Architectures
BioText Infrastructure Ariel Schwartz Gaurav Bhalotia 10/07/2002.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday.
1 The BioText Project Myers Seminar Sept 22, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI , ARDA AQUAINT,
Talend 5.4 Architecture Adam Pemble Talend Professional Services.
Conceptual Architecture of PostgreSQL PopSQL Andrew Heard, Daniel Basilio, Eril Berkok, Julia Canella, Mark Fischer, Misiu Godfrey.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS May, 2008 National e-Science Centre Edinburgh Dr Robert.
Patient Empowerment for Chronic Diseases System Sifat Islam Graduate Student, Center for Systems Integration, FAU, Copyright © 2011 Center.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Technology vocabulary slides assignment. Application Definition : A program or group of programs designed for end users. Application software can be divided.
Ontologies and Lexical Semantic Networks, Their Editing and Browsing Pavel Smrž and Martin Povolný Faculty of Informatics,
Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Grid Computing & Semantic Web. Grid Computing Proposed with the idea of electric power grid; Aims at integrating large-scale (global scale) computing.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida
Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.
CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.
SLIDE 1INFOSCALE Hong Kong Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Paul Watry Richard Marciano.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
In Vivo Imaging Middleware and Applications RSNA 2007 Berkant Barla Cambazoglu The Ohio State University Department of Biomedical Informatics.
Martin Kruliš by Martin Kruliš (v1.1)1.
Commission on Cyberinfrastructure for the Humanities and Social Sciences Metadata as Infrastructure, Interoperability, and the Larger Context Michael Buckland,
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Developing GRID Applications GRACE Project
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
SLIDE 1ISGC - Taiwan Grid-based Digital Libraries and Cheshire3 Ray R. Larson University of California, Berkeley School of Information.
Information Retrieval in Practice
Search Engine Architecture
WEB SERVICES.
Spark Presentation.
Grid Computing.
Overview: Fedora Architecture and Software Features
CHAPTER 3 Architectures for Distributed Systems
Building Search Systems for Digital Library Collections
XML in Web Technologies
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
CS6604 Digital Libraries IDEAL Webpages Presented by
Knowledge Based Workflow Building Architecture
DIGITAL LIBRARY.
Multiple Processor Systems
MANAGING DATA RESOURCES
Conceptual Architecture of PostgreSQL
Conceptual Architecture of PostgreSQL
Introduction to Information Retrieval
Multiple Processor and Distributed Systems
TN19-TCI: Integration and API management using TIBCO Cloud™ Integration
Marti Hearst Associate Professor SIMS, UC Berkeley
Presentation transcript:

SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information Management and Systems Thanks to Dr. Eric Yen, Prof. Michael Buckland Dr. Rob Sanderson and Prof. Marti Hearst for parts of this presentation

SLIDE 2NaCTeM Launch -Manchester Overview The Grid, Text Mining and Digital Libraries Cheshire3: Bringing Search and Text Mining to Grid-Based Digital Libraries Other Related Berkeley work: The BioText Project

SLIDE 3NaCTeM Launch -Manchester Grid middleware Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion.…. Portals Remote sensors..… Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.)

SLIDE 4NaCTeM Launch -Manchester Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Grid middleware Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion Humanities computing Digital Libraries … Portals Remote sensors Text Mining Metadata management Search & Retrieval … Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture (ECAI/AS Grid Digital Library Workshop) Bio-Medical

SLIDE 5NaCTeM Launch -Manchester Grid-Based Digital Libraries Large-scale distributed storage requirements and technologies, Distributed Information Retrieval issues and algorithms, Organizing distributed digital collections, Shared Metadata – standards and requirements Managing distributed digital collections, Security and access control, Collection Replication and backup.

SLIDE 6NaCTeM Launch -Manchester Cheshire3 Overview XML Information Retrieval Engine –3rd Generation of the UC Berkeley Cheshire system, as co-developed at the University of Liverpool. –Uses Python for flexibility and extensibility, but imports C/C++ based libraries for processing speed –Standards based: XML, XSLT, CQL, SRW/U, Z39.50, OAI to name a few. –Grid capable. Uses distributed configuration files, workflow definitions and PVM (currently) to scale from one machine to thousands of parallel nodes. –Free and Open Source Software. (GPL Licence) – (under development!)

SLIDE 7NaCTeM Launch -Manchester Cheshire3 Server Overview API INDEXINGINDEXING T R R X E A S C N L O S T R F D O R M S SEARCHSEARCH P H R A O N T D O L C E O R L DB API REMOTE SYSTEMS (any protocol) XML CONFIG & Metadata INFO INDEXES LOCAL DB STAFF UI CONFIG NETWORKNETWORK RESULT SETS SCANSCAN USER INFO CONFIG&CONTROLCONFIG&CONTROL ACCESS INFO AUTHENTICATIONAUTHENTICATION CLUSTERINGCLUSTERING Native calls Z39.50 SOAP OAI JDBC Fetch ID Put ID OpenURL APACHEINTERFACEAPACHEINTERFACE SERVER CONTROL UDDI WSRP SRW Normalization Client User/ Clients OGIS Cheshire III SERVER

SLIDE 8NaCTeM Launch -Manchester Cheshire Additions for NaCTeM

SLIDE 9NaCTeM Launch -Manchester UserStore Index PreParser ProtocolHandler DocumentGroup Document Extracter Normaliser Database Server Transformer Parser Record Document IndexStore RecordStore Cheshire3 Object Model

SLIDE 10NaCTeM Launch -Manchester Cheshire3 Data Objects DocumentGroup: –A collection of Document objects (e.g. from a file, directory, or external search) Document: –A single item, in any format (e.g. PDF file, raw XML string, relational table) Record: –A single item, represented as parsed XML Query: –A search query, in the form of CQL (an abstract query language for Information Retrieval) ResultSet: –An ordered list of pointers to records Index: –An ordered list of terms extracted from Records

SLIDE 11NaCTeM Launch -Manchester Cheshire3 Process Objects PreParser: –Given a Document, transform it into another Document (e.g. PDF to Text, Text to XML) Parser: –Given a Document as a raw XML string, return a parsed Record for the item. Transformer: –Given a Record, transform it into a Document (e.g. via XSLT, from XML to PDF, or XML to relational table) Extracter: –Extract terms of a given type from an XML sub-tree (e.g. extract Dates, Keywords, Exact string value) Normaliser: –Given the results of an extracter, transform the terms, maintaining the data structure (e.g. CaseNormaliser)

SLIDE 12NaCTeM Launch -Manchester Cheshire3 Abstract Objects Server: –A logical collection of databases Database: –A logical collection of Documents, their Record representations and Indexes of extracted terms. Workflow: –A 'meta-process' object that takes a workflow definition in XML and converts it into executable code.

SLIDE 13NaCTeM Launch -Manchester Cheshire3 Grid Tests Running on an 30 processor cluster in Liverpool using PVM (parallel virtual machine) Using 17 processors with one “master” and 16 “slave” processes we were able to parse and index MARC data at about records per second On a similar setup 610 Mb of TEI data can be parsed and indexed in seconds

SLIDE 14NaCTeM Launch -Manchester Other Work at Berkeley BioText Project –Directed by Prof. Marti Hearst of SIMS –Currently working on a number of areas in NLP analysis and use of Bio-Medical texts –Developing new and efficient methods for Abbreviation recognition Slide Credit: Prof Marti Hearst

SLIDE 15NaCTeM Launch -Manchester BioText: Main Goals Sophisticated Text Analysis Annotations in Database Improved Search Interface Slide Credit: Prof Marti Hearst

SLIDE 16NaCTeM Launch -Manchester BioText: A Two-Sided Approach SwissProt Blast Mesh GO Word Net Medline Journal Full Text Sophisticated Database Design & Algorithms Empirical Computational Linguistics Algorithms Slide Credit: Prof Marti Hearst

SLIDE 17NaCTeM Launch -Manchester Computational Language Goals Recognizing and annotating entities within textual documents Identifying semantic relations among entities To (eventually) be used in tandem with semi-automated reasoning systems. Slide Credit: Prof Marti Hearst

SLIDE 18NaCTeM Launch -Manchester Computational Linguistics Goals Mark up text with semantic relations – Slide Credit: Prof Marti Hearst

SLIDE 19NaCTeM Launch -Manchester Database Research Issues Efficient querying and updating –Semi-structured information –Fuzzy synonyms –Collection subsets Efficiently and effectively combining –Relational databases –Text databases Layers of processing –Hierarchical Ontologies Slide Credit: Prof Marti Hearst

SLIDE 20NaCTeM Launch -Manchester Abbreviation Recognition Example of BioText work (Schwartz and Hearst 03)… Fast, simple algorithm for recognizing abbreviation definitions. –Simpler and faster than the rest –Higher precision and recall –Idea: Work backwards from the end Examples: –In eukaryotes, the key to transcriptional regulation of the Heat Shock Response is the Heat Shock Transcription Factor (HSF). –Gcn5-related N-acetyltransferase (GNAT) Idea: use redundancy across abstracts to figure out abbreviation meaning even when definition is not present. Slide Credit: Prof Marti Hearst