Presentation is loading. Please wait.

Presentation is loading. Please wait.

2005.03.21 SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information.

Similar presentations


Presentation on theme: "2005.03.21 SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information."— Presentation transcript:

1 2005.03.21 SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information Management and Systems Thanks to Dr. Eric Yen, Prof. Michael Buckland Dr. Rob Sanderson and Prof. Marti Hearst for parts of this presentation

2 2005.03.21 SLIDE 2NaCTeM Launch -Manchester Overview The Grid, Text Mining and Digital Libraries Cheshire3: Bringing Search and Text Mining to Grid-Based Digital Libraries Other Related Berkeley work: The BioText Project

3 2005.03.21 SLIDE 3NaCTeM Launch -Manchester Grid middleware Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion.…. Portals Remote sensors..… Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.)

4 2005.03.21 SLIDE 4NaCTeM Launch -Manchester Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Grid middleware Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion Humanities computing Digital Libraries … Portals Remote sensors Text Mining Metadata management Search & Retrieval … Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture (ECAI/AS Grid Digital Library Workshop) Bio-Medical

5 2005.03.21 SLIDE 5NaCTeM Launch -Manchester Grid-Based Digital Libraries Large-scale distributed storage requirements and technologies, Distributed Information Retrieval issues and algorithms, Organizing distributed digital collections, Shared Metadata – standards and requirements Managing distributed digital collections, Security and access control, Collection Replication and backup.

6 2005.03.21 SLIDE 6NaCTeM Launch -Manchester Cheshire3 Overview XML Information Retrieval Engine –3rd Generation of the UC Berkeley Cheshire system, as co-developed at the University of Liverpool. –Uses Python for flexibility and extensibility, but imports C/C++ based libraries for processing speed –Standards based: XML, XSLT, CQL, SRW/U, Z39.50, OAI to name a few. –Grid capable. Uses distributed configuration files, workflow definitions and PVM (currently) to scale from one machine to thousands of parallel nodes. –Free and Open Source Software. (GPL Licence) –http://www.cheshire3.org/ (under development!)

7 2005.03.21 SLIDE 7NaCTeM Launch -Manchester Cheshire3 Server Overview API INDEXINGINDEXING T R R X E A S C N L O S T R F D O R M S SEARCHSEARCH P H R A O N T D O L C E O R L DB API REMOTE SYSTEMS (any protocol) XML CONFIG & Metadata INFO INDEXES LOCAL DB STAFF UI CONFIG NETWORKNETWORK RESULT SETS SCANSCAN USER INFO CONFIG&CONTROLCONFIG&CONTROL ACCESS INFO AUTHENTICATIONAUTHENTICATION CLUSTERINGCLUSTERING Native calls Z39.50 SOAP OAI JDBC Fetch ID Put ID OpenURL APACHEINTERFACEAPACHEINTERFACE SERVER CONTROL UDDI WSRP SRW Normalization Client User/ Clients OGIS Cheshire III SERVER

8 2005.03.21 SLIDE 8NaCTeM Launch -Manchester Cheshire Additions for NaCTeM

9 2005.03.21 SLIDE 9NaCTeM Launch -Manchester UserStore Index PreParser ProtocolHandler DocumentGroup Document Extracter Normaliser Database Server Transformer Parser Record Document IndexStore RecordStore Cheshire3 Object Model

10 2005.03.21 SLIDE 10NaCTeM Launch -Manchester Cheshire3 Data Objects DocumentGroup: –A collection of Document objects (e.g. from a file, directory, or external search) Document: –A single item, in any format (e.g. PDF file, raw XML string, relational table) Record: –A single item, represented as parsed XML Query: –A search query, in the form of CQL (an abstract query language for Information Retrieval) ResultSet: –An ordered list of pointers to records Index: –An ordered list of terms extracted from Records

11 2005.03.21 SLIDE 11NaCTeM Launch -Manchester Cheshire3 Process Objects PreParser: –Given a Document, transform it into another Document (e.g. PDF to Text, Text to XML) Parser: –Given a Document as a raw XML string, return a parsed Record for the item. Transformer: –Given a Record, transform it into a Document (e.g. via XSLT, from XML to PDF, or XML to relational table) Extracter: –Extract terms of a given type from an XML sub-tree (e.g. extract Dates, Keywords, Exact string value) Normaliser: –Given the results of an extracter, transform the terms, maintaining the data structure (e.g. CaseNormaliser)

12 2005.03.21 SLIDE 12NaCTeM Launch -Manchester Cheshire3 Abstract Objects Server: –A logical collection of databases Database: –A logical collection of Documents, their Record representations and Indexes of extracted terms. Workflow: –A 'meta-process' object that takes a workflow definition in XML and converts it into executable code.

13 2005.03.21 SLIDE 13NaCTeM Launch -Manchester Cheshire3 Grid Tests Running on an 30 processor cluster in Liverpool using PVM (parallel virtual machine) Using 17 processors with one “master” and 16 “slave” processes we were able to parse and index MARC data at about 10000 records per second On a similar setup 610 Mb of TEI data can be parsed and indexed in seconds

14 2005.03.21 SLIDE 14NaCTeM Launch -Manchester Other Work at Berkeley BioText Project –Directed by Prof. Marti Hearst of SIMS –Currently working on a number of areas in NLP analysis and use of Bio-Medical texts –Developing new and efficient methods for Abbreviation recognition Slide Credit: Prof Marti Hearst

15 2005.03.21 SLIDE 15NaCTeM Launch -Manchester BioText: Main Goals Sophisticated Text Analysis Annotations in Database Improved Search Interface Slide Credit: Prof Marti Hearst

16 2005.03.21 SLIDE 16NaCTeM Launch -Manchester BioText: A Two-Sided Approach SwissProt Blast Mesh GO Word Net Medline Journal Full Text Sophisticated Database Design & Algorithms Empirical Computational Linguistics Algorithms Slide Credit: Prof Marti Hearst

17 2005.03.21 SLIDE 17NaCTeM Launch -Manchester Computational Language Goals Recognizing and annotating entities within textual documents Identifying semantic relations among entities To (eventually) be used in tandem with semi-automated reasoning systems. Slide Credit: Prof Marti Hearst

18 2005.03.21 SLIDE 18NaCTeM Launch -Manchester Computational Linguistics Goals Mark up text with semantic relations – Slide Credit: Prof Marti Hearst

19 2005.03.21 SLIDE 19NaCTeM Launch -Manchester Database Research Issues Efficient querying and updating –Semi-structured information –Fuzzy synonyms –Collection subsets Efficiently and effectively combining –Relational databases –Text databases Layers of processing –Hierarchical Ontologies Slide Credit: Prof Marti Hearst

20 2005.03.21 SLIDE 20NaCTeM Launch -Manchester Abbreviation Recognition Example of BioText work (Schwartz and Hearst 03)… Fast, simple algorithm for recognizing abbreviation definitions. –Simpler and faster than the rest –Higher precision and recall –Idea: Work backwards from the end Examples: –In eukaryotes, the key to transcriptional regulation of the Heat Shock Response is the Heat Shock Transcription Factor (HSF). –Gcn5-related N-acetyltransferase (GNAT) Idea: use redundancy across abstracts to figure out abbreviation meaning even when definition is not present. Slide Credit: Prof Marti Hearst


Download ppt "2005.03.21 SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information."

Similar presentations


Ads by Google