Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University of Tokyo
Increments : accumulation Increase in Medline , , , , , ,000 年 increments 0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 accumulation
1.Institute for Medical Science (IMS), U-Tokyo Information Extraction from Text for Signal Pathways 2.Japan Bio-Informatics Research Centre (JBIRC) Interpretation of Micro-Array Data 3. Research Institute for Genetics (RIG) Disease-Gene Association 4. Research Institute for Natural Science (Riken) Tool for Curators for GO Annotation TEXT MINING for Bio-Medicine in Japan
1.Institute for Medical Science (IMS), U-Tokyo Information Extraction from Text for Signal Pathways 2.Japan Bio-Informatics Research Centre (JBIRC) Interpretation of Micro-Array Data 3. Research Institute for Genetics (RIG) Disease-Gene Associations 4. Research Institute for Natural Science (Riken) Tool for Curators for GO Annotation TEXT MINING for Bio-Medicine in Japan Resource Building for TM in BM : GENIA Project ( ) GENIA Corpus (Annotated Text) Information Exploitation System : Kototoi Project ( ) Adaptable POS Tagger (Bio-Tagger), NER adapted for BM Parser based on HPSG (Enju), ML for Text Processing
TEXT Mining= DATA Mining + BOW ? BOW : “Bag of Words” Model The model does not work because (1) Language is a complex system (2) Language is inherently associated with knowledge Mining + NLP + Knowledge Management TM products on market with fanciful visualization facilities and trend analysis tools
Ontology-based KMS Natural Language Processing Information Exploitation A Huge amount of Raw Data Unstructured Information (Text) Semi-structured Information (XML+Text) Structured Information (Data bases) Effective management of knowledge and information is the key
Information Extraction Module Identify & classify terms Identify events Raw(OCR)Text Structure Annotated Corpus DocumentNamed-EntityEvent Database OntologyMarkup language Data model Background Knowledge MEDLINE Retrieval Module Request enhancement Spawn request Classify documents Security User IR Request Abstract Full Paper Interface Module GUI HTML conversion System integration Concept Module Corpus Module Markup generation / compilation Annotated corpus construction Database Module DB design / access / management DB construction BK design / construction / compilation Overview of GENIA System
Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions 1.Size of Knowledge 2.Context Dependency 3.Evolving Nature of Science 4.Hypothetical Nature of Ontology 5.Inconsistency Motivated Independently of language Terminology NLP Paraphrasing
Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions 1.Size of Knowledge 2.Context Dependency 3.Evolving Nature of Science 4.Hypothetical Nature of Ontology 5.Inconsistency Motivated Independently of language Terminology NLP Paraphrase
address TermsConcepts address-as-a-speech address-as-a-mail-address address-as-a-street-address A term is introduced, without explicit understanding what it means, in order for one to make statements on it. Semantic Web by Tim Berners-Lee, et.al. Scientific American (2001)
Language DomainConcept Domain A cluster of realizations of terms
1.000 NF kappa B Transcription Factor NF kappa B NF-kappa B NF kB, Transcription Factor NF kB Immunoglobulin Enhancer-Binding Protein Immunoglobulin Enhancer Binding Protein Enhancer-Binding Protein, Immunoglobulin kappa B Enhancer Binding Protein Transcription Factor NF-kB Transcription Factor NF kB Factor NF-kB, Transcription nuclear factor kappa beta NF kappaB NF kappa B chain NF kappa B subunit Transcription Factor NF-kappa B NF-kB, Transcription Factor NF-kB Neurofibromatosis Type kappa B 0 Automatically Generated Variants
Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions 1.Size of Knowledge 2.Context Dependency 3.Evolving Nature of Science 4.Hypothetical Nature of Ontology 5.Inconsistency Motivated Independently of language Terminology NLP Paraphrase
Non-trivial Mapping Language Domain Knowledge Domain Independently motivated of Language Spelling Variants Synonyms Acronyms Same relations with different Structures Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to ….. [A] protein activates [B] (Pathway extraction) Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene. Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription. [sentence] > ([arg1_activate] > [protein]) Retrieval using Regional Algebra
Predicate-argument structure Parser based on Probabilistic HPSG (Enju) The protein is activated by it DT NN VBZ VBN IN PRP dt np vp vp pp np np pp vp s arg1 arg2 mod
Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions 1.Size of Knowledge 2.Context Dependency 3.Evolving Nature of Science 4.Hypothetical Nature of Ontology 5.Inconsistency Motivated Independently of language Terminology NLP Paraphrase
and in its absence, deficient 60 S ribosomes are assembled which are inactive in protein synthesis resulting in cell lethality. Mutations that completely abolish recognition of 26 S rRNA, however, block the formation of 60S particles, demonstrating that binding of L25 to this rRNA is an essential step in the assembly of the large ribosomal subunit. Depletion of Saccharmoyces cerevisiae ribosomal protein L16 causes decrease in 60S ribosomal subunits and formation of half-mer polyribosomes. Without L3, apparent synthesis of several 60 S subunit proteins diminished, and 60S subunit did not assemble. A similar phenomenon occurred, when a second strain, synthesis of ribosomal protein L29 was prevented. Term: Ribosomal large subunit assembly and maintenance
Language DomainConcept Domain Process of Ribosomal subunit assembly A cluster of realizations of terms
Information and Knowledge Exploitation System as an integrated management system of raw data, semi-structured data, text and structured data base + Mining Tools (Task Specific Software)
Text Archive with Feature Obejcts Managing texts, data representation and their semantics Text ID Start Position of the region End Position of the region Annotato r Content Text DB DB of Feature Objects Data Base Module Copy and Unification Specialization by unification Adding more augmented information induced by inference, type restriction, unification Adding more augmented information induced by inference, type restriction, unification Data representation Text Semantics Ubiquitin E is bound with
Information Extraction Module Identify & classify terms Identify events Raw(OCR)Text Structure Annotated Corpus DocumentNamed-EntityEvent Database OntologyMarkup language Data model Background Knowledge MEDLINE Retrieval Module Request enhancement Spawn request Classify documents Security User IR Request Abstract Full Paper Interface Module GUI HTML conversion System integration Concept Module Corpus Module Markup generation / compilation Annotated corpus construction Database Module DB design / access / management DB construction BK design / construction / compilation Overview of GENIA System