Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project (http://www-tsujii.is.s.u- tokyo.ac.jp/GENIA/) Computer Science, University.

Slides:



Advertisements
Similar presentations
Application of the NLP techniques to IE and IR CREST.
Advertisements

1 National Centre for Text Mining Mission To provide TM tools for users, in particular, scientists and researchers To coordinate activities in the TM community.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
SEVENPRO – STREP KEG seminar, Prague, 8/November/2007 © SEVENPRO Consortium SEVENPRO – Semantic Virtual Engineering Environment for Product.
TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Erasmus University Rotterdam Frederik HogenboomEconometric Institute School of Economics Flavius Frasincar.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
The Semantic Web – A Vision Tim Berners-Lee, James Hendler and Ora Lassila Scientific American, May 2001.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Protein Synthesis Ordinary Level. Lesson Objectives At the end of this lesson you should be able to 1.Outline the steps in protein synthesis 2.Understand.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
1 SRI International Bioinformatics Advanced PGDB Editing: Regulation GO Terms Ingrid M. Keseler Bioinformatics Research Group SRI International
WP5.4 - Introduction  Knowledge Extraction from Complementary Sources  This activity is concerned with augmenting the semantic multimedia metadata basis.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
NLP for Biomedicine - Ontology building and Text Mining - Junichi Tsujii GENIA Project ( Computer Science Graduate.
EXCS Sept Knowledge Engineering Meets Software Engineering Hele-Mai Haav Institute of Cybernetics at TUT Software department.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
A hybrid method for Mining Concepts from text CSCE 566 semester project.
Chapter 1 Introduction to Data Mining
BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
Flexible Text Mining using Interactive Information Extraction David Milward
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
 Copyright 2008 Digital Enterprise Research Institute. All rights reserved. Semantic on the Social Semantic Desktop.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Cell Signaling Ontology Takako Takai-Igarashi and Toshihisa Takagi Human Genome Center, Institute of Medical Science, University of Tokyo.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
Proposed NWI KIF/CG --> Common Logic Standard A working group was recently formed from the KIF working group. John Sowa is the only CG representative so.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Ontology-Centered Personalized Presentation of Knowledge Extracted from the Web Ralitsa Angelova.
Towards the Semantic Web 6 Generating Ontologies for the Semantic Web: OntoBuilder R.H.P. Engles and T.Ch.Lech 이 은 정
CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.
OWL Representing Information Using the Web Ontology Language.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
March 31, 1998NSF IDM 98, Group F1 Group F Multi-modal Issues, Systems and Applications.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
Mining the Biomedical Research Literature Ken Baclawski.
1 MedAT: Medical Resources Annotation Tool Monika Žáková *, Olga Štěpánková *, Taťána Maříková * Department of Cybernetics, CTU Prague Institute of Biology.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining knowledge from natural language texts using fuzzy associated concept mapping Presenter : Wu,
NATURAL LANGUAGE PROCESSING Zachary McNellis. Overview  Background  Areas of NLP  How it works?  Future of NLP  References.
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
©2003 Paula Matuszek CSC 9010: AeroText, Ontologies, AeroDAML Dr. Paula Matuszek (610)
Semantic Web. P2 Introduction Information management facilities not keeping pace with the capacity of our information storage. –Information Overload –haphazardly.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Development of the Amphibian Anatomical Ontology
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Terminology problems in literature mining and NLP
High-throughput Biological Data The data deluge
CCO: concept & current status
CSE 635 Multimedia Information Retrieval
Batyr Charyyev.
By Hossein Hematialam and Wlodek Zadrozny Presented by
Presentation transcript:

Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University of Tokyo

Increments : accumulation Increase in Medline , , , , , ,000 年 increments 0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 accumulation

1.Institute for Medical Science (IMS), U-Tokyo Information Extraction from Text for Signal Pathways 2.Japan Bio-Informatics Research Centre (JBIRC) Interpretation of Micro-Array Data 3. Research Institute for Genetics (RIG) Disease-Gene Association 4. Research Institute for Natural Science (Riken) Tool for Curators for GO Annotation TEXT MINING for Bio-Medicine in Japan

1.Institute for Medical Science (IMS), U-Tokyo Information Extraction from Text for Signal Pathways 2.Japan Bio-Informatics Research Centre (JBIRC) Interpretation of Micro-Array Data 3. Research Institute for Genetics (RIG) Disease-Gene Associations 4. Research Institute for Natural Science (Riken) Tool for Curators for GO Annotation TEXT MINING for Bio-Medicine in Japan Resource Building for TM in BM : GENIA Project ( ) GENIA Corpus (Annotated Text) Information Exploitation System : Kototoi Project ( ) Adaptable POS Tagger (Bio-Tagger), NER adapted for BM Parser based on HPSG (Enju), ML for Text Processing

TEXT Mining= DATA Mining + BOW ? BOW : “Bag of Words” Model The model does not work because (1) Language is a complex system (2) Language is inherently associated with knowledge Mining + NLP + Knowledge Management TM products on market with fanciful visualization facilities and trend analysis tools

Ontology-based KMS Natural Language Processing Information Exploitation A Huge amount of Raw Data Unstructured Information (Text) Semi-structured Information (XML+Text) Structured Information (Data bases) Effective management of knowledge and information is the key

Information Extraction Module Identify & classify terms Identify events Raw(OCR)Text Structure Annotated Corpus DocumentNamed-EntityEvent Database OntologyMarkup language Data model Background Knowledge MEDLINE Retrieval Module Request enhancement Spawn request Classify documents Security User IR Request Abstract Full Paper Interface Module GUI HTML conversion System integration Concept Module Corpus Module Markup generation / compilation Annotated corpus construction Database Module DB design / access / management DB construction BK design / construction / compilation Overview of GENIA System

Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions 1.Size of Knowledge 2.Context Dependency 3.Evolving Nature of Science 4.Hypothetical Nature of Ontology 5.Inconsistency Motivated Independently of language Terminology NLP Paraphrasing

Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions 1.Size of Knowledge 2.Context Dependency 3.Evolving Nature of Science 4.Hypothetical Nature of Ontology 5.Inconsistency Motivated Independently of language Terminology NLP Paraphrase

address TermsConcepts address-as-a-speech address-as-a-mail-address address-as-a-street-address A term is introduced, without explicit understanding what it means, in order for one to make statements on it. Semantic Web by Tim Berners-Lee, et.al. Scientific American (2001)

Language DomainConcept Domain A cluster of realizations of terms

1.000 NF kappa B Transcription Factor NF kappa B NF-kappa B NF kB, Transcription Factor NF kB Immunoglobulin Enhancer-Binding Protein Immunoglobulin Enhancer Binding Protein Enhancer-Binding Protein, Immunoglobulin kappa B Enhancer Binding Protein Transcription Factor NF-kB Transcription Factor NF kB Factor NF-kB, Transcription nuclear factor kappa beta NF kappaB NF kappa B chain NF kappa B subunit Transcription Factor NF-kappa B NF-kB, Transcription Factor NF-kB Neurofibromatosis Type kappa B 0 Automatically Generated Variants

Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions 1.Size of Knowledge 2.Context Dependency 3.Evolving Nature of Science 4.Hypothetical Nature of Ontology 5.Inconsistency Motivated Independently of language Terminology NLP Paraphrase

Non-trivial Mapping Language Domain Knowledge Domain Independently motivated of Language Spelling Variants Synonyms Acronyms Same relations with different Structures Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to ….. [A] protein activates [B] (Pathway extraction) Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene. Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription. [sentence] > ([arg1_activate] > [protein]) Retrieval using Regional Algebra

Predicate-argument structure Parser based on Probabilistic HPSG (Enju) The protein is activated by it DT NN VBZ VBN IN PRP dt np vp vp pp np np pp vp s arg1 arg2 mod

Non-Trivial Mappings Language Domain Knowledge Domain Concepts and Relationships among Them Linguistic expressions 1.Size of Knowledge 2.Context Dependency 3.Evolving Nature of Science 4.Hypothetical Nature of Ontology 5.Inconsistency Motivated Independently of language Terminology NLP Paraphrase

and in its absence, deficient 60 S ribosomes are assembled which are inactive in protein synthesis resulting in cell lethality. Mutations that completely abolish recognition of 26 S rRNA, however, block the formation of 60S particles, demonstrating that binding of L25 to this rRNA is an essential step in the assembly of the large ribosomal subunit. Depletion of Saccharmoyces cerevisiae ribosomal protein L16 causes decrease in 60S ribosomal subunits and formation of half-mer polyribosomes. Without L3, apparent synthesis of several 60 S subunit proteins diminished, and 60S subunit did not assemble. A similar phenomenon occurred, when a second strain, synthesis of ribosomal protein L29 was prevented. Term: Ribosomal large subunit assembly and maintenance

Language DomainConcept Domain Process of Ribosomal subunit assembly A cluster of realizations of terms

Information and Knowledge Exploitation System as an integrated management system of raw data, semi-structured data, text and structured data base + Mining Tools (Task Specific Software)

Text Archive with Feature Obejcts Managing texts, data representation and their semantics Text ID Start Position of the region End Position of the region Annotato r Content Text DB DB of Feature Objects Data Base Module Copy and Unification Specialization by unification Adding more augmented information induced by inference, type restriction, unification Adding more augmented information induced by inference, type restriction, unification Data representation Text Semantics Ubiquitin E is bound with

Information Extraction Module Identify & classify terms Identify events Raw(OCR)Text Structure Annotated Corpus DocumentNamed-EntityEvent Database OntologyMarkup language Data model Background Knowledge MEDLINE Retrieval Module Request enhancement Spawn request Classify documents Security User IR Request Abstract Full Paper Interface Module GUI HTML conversion System integration Concept Module Corpus Module Markup generation / compilation Annotated corpus construction Database Module DB design / access / management DB construction BK design / construction / compilation Overview of GENIA System