1 National Centre for Text Mining Mission To provide TM tools for users, in particular, scientists and researchers To coordinate activities in the TM community Core Partners University of Manchester: NLP and DM Salford University: Terminology Liverpool University: IR and Digital Archive External Partners San Diego SC, UC Berkeley, University of Geneva, University of Tokyo
2 National Centre for Text Mining Mission To provide TM tools for users, in particular, scientists and researchers To coordinate activities in the TM community Core Partners University of Manchester: NLP and DM Salford University: Terminology Liverpool University: IR and Digital Archive External Partners San Diego SC, UC Berkeley, University of Geneva, University of Tokyo Biomedical domain
3 Strategy and Roadmap for TM in Biomedicine Vast number of Google/Yahoo users, satisfied Small number of users, unsatisfied Huge Demand for specialized tools for TM in Bio-Medical Domains The current TM tools, though successful in some business applications, do not meet requirements of users in bio-medical domains. What are the requirements for TM for users in bio-medical domains? What technologies should be integrated in future TM for science? More demand-oriented approach Is the nature of TM in scientific fields different from that of business applications? More publicity and marketing
4 From technological seeds
5 Science: Knowledge Raw Data Unstructured Information (Text) Semi-structured Information (XML+Text) Structured Information (Data bases) Ontology-based KMS Natural Language Processing Intelligent Text Management System Effective management of text and knowledge is the key
6 Intelligent TM systems Intelligent Information Retrieval and Question Answering Retrieval Integration of Text with Data and Knowledge Integration Text Mining and Knowledge Discovery Discovery
7 From Text to Knowledge Language Domain Knowledge Domain Non-Trivial Mappings Terminology NLP Paraphrasing Motivated Independently of language Ontology Relationships among concepts Metabolic Pathways Signal Pathways Association between Diseases and Genes ……
8 Examples of Technical Seeds Term Variants –Terms (names of proteins, genes, diseases, symptoms, etc.) denote basic conceptual units in the knowledge domain. Syntactic Variants –Relationships and complex conceptual units are mapped to sentences. Term Acquisition from Text –New terms (basic conceptual units) are constantly introduced. Resource building for specialized domains is crucial.
9 Examples of Technical Seeds Term Variants –Terms (names of proteins, genes, diseases, symptoms, etc.) denote basic conceptual units in the knowledge domain. Syntactic Variants –Relationships and complex conceptual units are mapped to sentences. Term Acquisition from Text –New terms (basic conceptual units) are constantly introduced. Resource building for specialized domains is crucial.
10 NF-kappa B NF kappa B NFKB factor NF-KB NF kB acronym Expanded form nuclear factor-kappa B nuclear-factor kappa B nuclear factor kappa B nuclear factor κB Nuclear Factor kappa B ……….. Spelling variation Synonym Hypernym
11 Automatic Generated Term Variants (1) NF kappa B Transcription Factor NF kappa B NF-kappa B NF kB Immunoglobulin Enhancer-Binding Protein Immunoglobulin Enhancer Binding Protein Transcription Factor NF-kB Transcription Factor NF kB Factor NF-kB, Transcription nuclear factor kappa beta NF kappaB NF kappa B chain NF kappa B subunit Transcription Factor NF-kappa B NF-kB, Transcription Factor NF-kB Neurofibromatosis Type kappa B 0
12 Automatic Generated Term Variants (2) tumor necrosis factor A TNF A tumor necrosis factor TNF alpha TNFA TNF Tumour necrosis factor alpha Tumor Necrosis Factor alpha Tumor Necrosis Factor-Alpha TUMOR NECROSIS FACTOR.ALPHA Tumor necrosis factor alpha Tumor Necrosis Factor-alpha TNF-Alpha TNF-alpha 6899
13 Examples of Technical Seeds Term Variants –Terms (names of proteins, genes, diseases, symptoms, etc.) denote basic conceptual units in the knowledge domain. Syntactic Variants –Relationships and complex conceptual units in the knowledge domain are mapped to sentences in the language domain. Term Acquisition from Text –New terms (basic conceptual units) are constantly introduced. Resource building for specialized domains is crucial.
14 Non-trivial Mapping Language Domain Knowledge Domain Independently motivated of Language Spelling Variants Synonyms Acronyms Same relations with different Structures Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to ….. [A] protein activates [B] (Pathway extraction) Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene. Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription. Syntactic Variants
15 Predicate-argument structure Parser based on Probabilistic HPSG (Enju) The protein is activated by it DT NN VBZ VBN IN PRP dt np vp vp pp np np pp vp s arg1 arg2 mod
16 Text Archive with Feature Obejcts Managing texts, data representation and their semantics Text ID Start Position of the region End Position of the region Annotato r Content Text DB DB of Feature Objects Data Base Module Copy and Unification Specialization by unification Data representation Text Semantics Ubiquitin E is bound with Fine grained units of information Context dependency Persistent nature of knowledge and information Fine grained units of information Context dependency Persistent nature of knowledge and information
17 Demo (The website demo is not available now. )