CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.

CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool azaroth@liv.ac.uk http://www.nactem.ac.uk

CNI, 3rd April 2006 Slide 2 Overview Text Mining? NaCTeM Consortium Components Service Infrastructure Future Work

CNI, 3rd April 2006 Slide 3 Centre for... National Centre for... what was that? Ticks Mining! TEXT

CNI, 3rd April 2006 Slide 4... Text Mining?... Text Mining? Text Mining: No canonical definition Commonly used definition based on Data Mining: “The non-trivial extraction of implicit, previously unknown, and potentially useful information from data.” “The non-trivial extraction of previously unknown, interesting facts from an invariably large collection of texts.”

CNI, 3rd April 2006 Slide 5... Text Mining?... Text Mining? Typical Data Mining Functions: Classification Association Rule Mining Clustering Useful when applied to texts, but doesn't fulfill the definition as they don't discover “facts”. Information Retrieval also doesn't discover facts.

CNI, 3rd April 2006 Slide 6... Text Mining?... Text Mining? Need to understand the meaning of the text: Part of Speech tagging Clauses Named Entity Recognition Find correlations of entities Infer information from logical chains Result: New Knowledge

CNI, 3rd April 2006 Slide 7 Other Benefits Other Benefits Plus a lot more: Improved document classification Automatic semantic annotation of documents Improved access -- search by semantics and concepts Improved clustering of documents by concept Summarization Visualization techniques

CNI, 3rd April 2006 Slide 8 Event Extraction Event Extraction Extract events from the text along with information about the participants Can be modeled as relationships between named entities Extracting events allows discovery of hidden temporal correlations eg: Google refuses to announce plans. Google's stock falls. Improves understanding of the semantics, improving the functions based around those semantics

CNI, 3rd April 2006 Slide 9 NaCTeM NaCTeM Hosted at University of Manchester Participants: Universities of Manchester, Liverpool, Salford Plus: San Diego Supercomputer Centre, University of Tokyo, University of Geneva, University of California Berkeley Six full time posts for 3 years (2005-2007) Plus active board of directors and experts Current Director: Professor Jun'ichi Tsujii from U.Tokyo Funding: JISC, BBSRC, EPSRC

CNI, 3rd April 2006 Slide 10 NaCTeM Aims NaCTeM Aims Provide text mining oriented services Facilitate access to text mining resources User support, advice, training and consultancy Participate in international research Formulate best practice guidelines Increase awareness of text mining in all domains Develop links with industrial partners involved in text mining

CNI, 3rd April 2006 Slide 11 Components Components Liverpool: Cheshire3 (Information framework) Manchester: CAFETIERE (Entity recognition, event extraction) Salford: TerMine (Automatic term recognition) SDSC: Storage Resource Broker (Data grid) UC Berkeley: Cheshire, TM/IR expertise U.Tokyo: GENIA, ENJU (Text analysis tools) U.Geneva: User studies and evaluation

CNI, 3rd April 2006 Slide 12 Cheshire3 Cheshire3 Information Processing Framework Liverpool and UC Berkeley Standards based: XML, SRU, Unicode, etc. Scalable: Single machine to Grid (PVM, MPI, SRB) Extensible: Python + C, Object Oriented with stable API Work ongoing to integrate Data Mining tools and other information processing applications

CNI, 3rd April 2006 Slide 13 Cheshire3 Examples Cheshire3 Examples Integrated tools from other participants in preparation for NaCTeM service infrastructure. Medline: 4350 records/second using 60 concurrent processes on SDSC's Teragrid cluster 440 seconds to index 1 field from 16 million MARC records Distributed network of Archival Descriptions in the UK NARA ERA prototype system with SDSC

CNI, 3rd April 2006 Slide 14 CAFETIERE CAFETIERE Entity Recognition and Annotation University of Manchester Discovers named entities in part of speech tagged text Discovers temporal events referring to those entities Integration of ontologies and term processing Rules based

CNI, 3rd April 2006 Slide 15 CAFETIERE Example CAFETIERE Example

CNI, 3rd April 2006 Slide 16 TerMine TerMine Automatic Term Recognition University of Salford/Manchester Discovers important terms Assigns 'C-value' score to rank terms Interaction with terminology databases for term management

CNI, 3rd April 2006 Slide 17 TerMine Example TerMine Example

CNI, 3rd April 2006 Slide 18 U. Tokyo Tools U. Tokyo Tools Natural Language Parsing University of Tokyo Tagger, Chunker, ENJU, GENIA Necessary for any text mining application Fast and accurate http://www-tsujii.is.s.u-tokyo.ac.jp/hiiragi/ http://www-tsujii.is.s.u-tokyo.ac.jp/CytoSailing/

CNI, 3rd April 2006 Slide 19 Tokyo Tools Example Tokyo Tools Example

CNI, 3rd April 2006 Slide 20 Tokyo Tools Example2 Tokyo Tools Example2

CNI, 3rd April 2006 Slide 21 Service Infrastructure Service Infrastructure NaCTeM will allow UK researchers to perform text mining on their own data in combination with other accessible resources (eg other data sets, ontologies etc) Requirements: Lots of processing power Lots of storage capacity Easily extensible/configurable service framework Access to cutting edge TM, DM and IR tools

CNI, 3rd April 2006 Slide 22 Service Infrastructure Service Infrastructure Processing provided by UK National Grid Service Data Storage via SDSC's Storage Resource Broker Important to store multiple versions of each document Cheshire3 provides the Grid enabled information infrastructure Plus information retrieval and data mining tools Manchester and Tokyo provide the text mining tools Stable tools integrated into Cheshire3 already

CNI, 3rd April 2006 Slide 23 Service Infrastructure Service Infrastructure Initial NaCTeM services will be focused on the bio domain: Bio-informatics is a growing field Interest from both academic and corporate sectors Large datasets/services available (MeSH, Medline,...) Web portal interaction Then expand into other areas, such as Social Sciences and Historical text analysis.

CNI, 3rd April 2006 Slide 24 Future Work Future Work Services for other domains GUI Workflow configuration Integration of user developed services and applications Maximizing workflow potential with 'smart' components Standardizing annotation schemas Conference/Workshop Other?

CNI, 3rd April 2006 Slide 25 Thank You Thank You Questions?... Reception!

CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.

Similar presentations

Presentation on theme: "CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.

Similar presentations

Presentation on theme: "CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool."— Presentation transcript:

Similar presentations

About project

Feedback