Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project (http://www-tsujii.is.s.u- tokyo.ac.jp/GENIA/) Computer Science, University.

Similar presentations


Presentation on theme: "Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project (http://www-tsujii.is.s.u- tokyo.ac.jp/GENIA/) Computer Science, University."— Presentation transcript:

1 Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project (http://www-tsujii.is.s.u- tokyo.ac.jp/GENIA/) Computer Science, University of Tokyo

2 Increments : accumulation Increase in Medline 2002200019981992199419961990 19881980 19821984198619781970197219741976196819661964 0 100,000 200,000 300,000 400,000 500,000 600,000 年 increments 0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 accumulation

3 1.Institute for Medical Science (IMS), U-Tokyo Information Extraction from Text for Signal Pathways 2.Japan Bio-Informatics Research Centre (JBIRC) Interpretation of Micro-Array Data 3. Research Institute for Genetics (RIG) Disease-Gene Association 4. Research Institute for Natural Science (Riken) Tool for Curators for GO Annotation TEXT MINING for Bio-Medicine in Japan

4 1.Institute for Medical Science (IMS), U-Tokyo Information Extraction from Text for Signal Pathways 2.Japan Bio-Informatics Research Centre (JBIRC) Interpretation of Micro-Array Data 3. Research Institute for Genetics (RIG) Disease-Gene Associations 4. Research Institute for Natural Science (Riken) Tool for Curators for GO Annotation TEXT MINING for Bio-Medicine in Japan Resource Building for TM in BM : GENIA Project (1998 - ) GENIA Corpus (Annotated Text) Information Exploitation System : Kototoi Project (2000 - ) Adaptable POS Tagger (Bio-Tagger), NER adapted for BM Parser based on HPSG (Enju), ML for Text Processing

5 TEXT Mining= DATA Mining + BOW ? BOW : “Bag of Words” Model The model does not work because (1) Language is a complex system (2) Language is inherently associated with knowledge Mining + NLP + Knowledge Management TM products on market with fanciful visualization facilities and trend analysis tools

6 Ontology-based KMS Natural Language Processing Information Exploitation A Huge amount of Raw Data Unstructured Information (Text) Semi-structured Information (XML+Text) Structured Information (Data bases) Effective management of knowledge and information is the key

7 Information Extraction Module Identify & classify terms Identify events Raw(OCR)Text Structure Annotated Corpus DocumentNamed-EntityEvent Database OntologyMarkup language Data model Background Knowledge MEDLINE Retrieval Module Request enhancement Spawn request Classify documents Security User IR Request Abstract Full Paper Interface Module GUI HTML conversion System integration Concept Module Corpus Module Markup generation / compilation Annotated corpus construction Database Module DB design / access / management DB construction BK design / construction / compilation Overview of GENIA System

8 Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions 1.Size of Knowledge 2.Context Dependency 3.Evolving Nature of Science 4.Hypothetical Nature of Ontology 5.Inconsistency Motivated Independently of language Terminology NLP Paraphrasing

9 Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions 1.Size of Knowledge 2.Context Dependency 3.Evolving Nature of Science 4.Hypothetical Nature of Ontology 5.Inconsistency Motivated Independently of language Terminology NLP Paraphrase

10 address TermsConcepts address-as-a-speech address-as-a-mail-address address-as-a-street-address A term is introduced, without explicit understanding what it means, in order for one to make statements on it. Semantic Web by Tim Berners-Lee, et.al. Scientific American (2001)

11 Language DomainConcept Domain A cluster of realizations of terms

12 1.000 NF kappa B 128 0.500 Transcription Factor NF kappa B 0 0.429 NF-kappa B 912 0.286 NF kB, Transcription Factor 0 0.286 NF kB 0 0.286 Immunoglobulin Enhancer-Binding Protein 0 0.286 Immunoglobulin Enhancer Binding Protein 0 0.286 Enhancer-Binding Protein, Immunoglobulin 0 0.286 kappa B Enhancer Binding Protein 0 0.286 Transcription Factor NF-kB 0 0.286 Transcription Factor NF kB 0 0.286 Factor NF-kB, Transcription 0 0.286 nuclear factor kappa beta 2 0.286 NF kappaB 1 0.273 NF kappa B chain 0 0.273 NF kappa B subunit 0 0.214 Transcription Factor NF-kappa B 0 0.214 NF-kB, Transcription Factor 0 0.214 NF-kB 67 0.200 Neurofibromatosis Type kappa B 0 Automatically Generated Variants

13 Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions 1.Size of Knowledge 2.Context Dependency 3.Evolving Nature of Science 4.Hypothetical Nature of Ontology 5.Inconsistency Motivated Independently of language Terminology NLP Paraphrase

14 Non-trivial Mapping Language Domain Knowledge Domain Independently motivated of Language Spelling Variants Synonyms Acronyms Same relations with different Structures Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to ….. [A] protein activates [B] (Pathway extraction) Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene. Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription. [sentence] > ([arg1_activate] > [protein]) Retrieval using Regional Algebra

15 Predicate-argument structure Parser based on Probabilistic HPSG (Enju) The protein is activated by it DT NN VBZ VBN IN PRP dt np vp vp pp np np pp vp s arg1 arg2 mod

16

17 Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions 1.Size of Knowledge 2.Context Dependency 3.Evolving Nature of Science 4.Hypothetical Nature of Ontology 5.Inconsistency Motivated Independently of language Terminology NLP Paraphrase

18

19

20 and in its absence, deficient 60 S ribosomes are assembled which are inactive in protein synthesis resulting in cell lethality. Mutations that completely abolish recognition of 26 S rRNA, however, block the formation of 60S particles, demonstrating that binding of L25 to this rRNA is an essential step in the assembly of the large ribosomal subunit. Depletion of Saccharmoyces cerevisiae ribosomal protein L16 causes decrease in 60S ribosomal subunits and formation of half-mer polyribosomes. Without L3, apparent synthesis of several 60 S subunit proteins diminished, and 60S subunit did not assemble. A similar phenomenon occurred, when a second strain, synthesis of ribosomal protein L29 was prevented. Term: Ribosomal large subunit assembly and maintenance

21 Language DomainConcept Domain Process of Ribosomal subunit assembly A cluster of realizations of terms

22 Information and Knowledge Exploitation System as an integrated management system of raw data, semi-structured data, text and structured data base + Mining Tools (Task Specific Software)

23 Text Archive with Feature Obejcts Managing texts, data representation and their semantics Text ID Start Position of the region End Position of the region Annotato r Content Text DB DB of Feature Objects Data Base Module Copy and Unification Specialization by unification Adding more augmented information induced by inference, type restriction, unification Adding more augmented information induced by inference, type restriction, unification Data representation Text Semantics Ubiquitin E is bound with

24 Information Extraction Module Identify & classify terms Identify events Raw(OCR)Text Structure Annotated Corpus DocumentNamed-EntityEvent Database OntologyMarkup language Data model Background Knowledge MEDLINE Retrieval Module Request enhancement Spawn request Classify documents Security User IR Request Abstract Full Paper Interface Module GUI HTML conversion System integration Concept Module Corpus Module Markup generation / compilation Annotated corpus construction Database Module DB design / access / management DB construction BK design / construction / compilation Overview of GENIA System


Download ppt "Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project (http://www-tsujii.is.s.u- tokyo.ac.jp/GENIA/) Computer Science, University."

Similar presentations


Ads by Google