NLP for Biomedicine - Ontology building and Text Mining - Junichi Tsujii GENIA Project (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/) Computer Science Graduate.

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

A Probabilistic Term Variant Generator for Biomedical Terms Yoshimasa Tsuruoka and Jun ichi Tsujii CREST, JST The University of Tokyo.
1 National Centre for Text Mining Mission To provide TM tools for users, in particular, scientists and researchers To coordinate activities in the TM community.
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Bio-Medical Interaction Extractor Syed Toufeeq Ahmed ASU.
FP7 meeting - Gent - Carlos Rodríguez - April 18 WP4: Conceptual Mining from Text for Knowledge Engineering State of the Art WP Coordinators: Alfonso Valencia.
Coreference Based Event-Argument Relation Extraction on Biomedical Text Katsumasa Yoshikawa 1), Sebastian Riedel 2), Tsutomu Hirao 3), Masayuki Asahara.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Predicting Gene Functions from Text Using a Cross- Species Approach Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.
Faculty of Computer Science © 2006 CMPUT 605March 31, 2008 Towards Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
The bioinformatics of biological processes The challenge of temporal data Per J. Kraulis CMCM, Tartu University.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Natural Language Understanding
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
BioNLP related talks and demos at ACL and CONLL ‘05 Presented by Beatrice Alex BioNLP meeting 11 th of July 2005.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Towards Building A Database of Phosphorylate Interactions Extracting Information from the Literature M. Narayanaswamy & K. E. Ravikumar AU-KBC Center,
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
A Biology Primer Part IV: Gene networks and systems biology Vasileios Hatzivassiloglou University of Texas at Dallas.
From GENIA to BioTop – Towards a Top-Level Ontology for Biology Stefan Schulz, Elena Beisswanger, Udo Hahn, Joachim Wermter, Anand Kumar, Holger Stenzhorn.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.
Mining the Biomedical Research Literature Ken Baclawski.
Bioinformatics and Computational Biology
Introduction to biological molecular networks
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Retrospective study of a gene by mining texts : The Hepcidin use-case Fouzia Moussouni-Marzolf.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Information Extraction in Biology Junichi Tsujii GENIA Project ( Computer Science University of Tokyo.
Sentiment analysis algorithms and applications: A survey
School of Computer Science & Engineering
Protein association networks with STRING
STRING Large-scale data and text mining
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Terminology problems in literature mining and NLP
Lixia Yao, James A. Evans, Andrey Rzhetsky  Trends in Biotechnology 
CS246: Information Retrieval
Presentation transcript:

NLP for Biomedicine - Ontology building and Text Mining - Junichi Tsujii GENIA Project ( Computer Science Graduate School of Information Science and Technology University of Tokyo JAPAN

My Talk 1.Background : Why NLP in Biomedicines 2. Examples of NLP in Biomedicines 3. Text Mining and NLP 4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition 5. Concluding Remarks

My Talk 1.Background : Why NLP in Biomedicines 2. Examples of NLP in Biomedicines 3. Text Mining and NLP 4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition 5. Concluding Remarks

Why NLP in Biomedicine ? From Biology and Medical Sciences From Natural Language Processing

Why NLP in Biomedicine ? From Biology and Medical Sciences From Natural Language Processing

by D. Devos Genome sequencing.

Function Sequence Structure Sequence, structure and function Information Exploitation

Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previously characterized functions in a separate process. The use of available information (published papers, etc.) is a key step for the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used to substantiate working hypothesis that are experimentally explored. [C.Blaschke, A.Valencia: 2001]

Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previously characterized functions in a separate process. The use of available information (published papers, etc.) is a key step for the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used to substantiate working hypothesis that are experimentally explored. [C.Blaschke, A.Valencia: 2001]

Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previously characterized functions in a separate process. The use of available information (published papers, etc.) is a key step for the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used to substantiate working hypothesis that are experimentally explored. [C.Blaschke, A.Valencia: 2001]

Why NLP in Biomedicine ? From Biology and Medical Sciences From Natural Language Processing

Revolution in LT in the last decade Information Knowledge Language Texts Grammar Syntax-Semantic Mapping Interpretation based on Knowledge Machine Learning Knowledge Acquisition Statistical Biases Huge Ontology: Next Revolution ? Bio-Medical Application: UMLS, Gene Ontology, etc.

My Talk 1.Background : Why NLP in Biomedicines 2. Examples of NLP in Biomedicines 3. Text Mining and NLP 4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition 5. Concluding Remarks

What can we do in Biomedical domains by NLP ? Examples

Protein-Protein Interaction extracted from texts by C. Blaschke

Organized Knowledge through terms by C. Blaschke

From Data to Understanding : Interpretation by Language Oliveros, Blaschke et al., GIW 2000

Information Extraction from Texts QA Answering Systems

Characteristics of Signal Pathway (1) Granularity of Knowledge Units Different types of entities which are interrelated with each other Cells, Sub-locations of cells Proteins, substructures of proteins, Subclasses of proteins Ions, other chemical substances Genes, RNA, DNA G-protein coupled receptor pathway model figure from TRANSPATH

CSNDB ( National Institute of Health Sciences) A data- and knowledge- base for signaling pathways of human cells. –It compiles the information on biological molecules, sequences, structures, functions, and biological reactions which transfer the cellular signals. –Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically. –CSNDB is constructed on ACEDB and inference engine CLIPS, and has a linkage to TRANSFAC. –Final goal is to make a computerized model for various biological phenomena.

Example. 1 A Standard Reaction Signal_Reaction: “EGF receptor  Grb2” From_molecule “EGF receptor” To_molecule “Grb2” Tissue “liver” Effect “activation” Interaction “SH2+phosphorylated Tyr” Reference [Yamauchi_1997]

Example. 3 A Polymerization Reaction Signal_Reaction: “Ah receptor + HSP90  ” Component “Ah receptor” “HSP90” Effect “activation dissociation” Interaction “PAS domain of Ah receptor” Activity “inactivation of Ah receptor” Reference [Powell-Coffman_1998]

My Talk 1.Background : Why NLP in Biomedicines 2. Examples of NLP in Biomedicines 3. Text Mining and NLP 4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition 5. Concluding Remarks

Theories in Science Observed Data ObservableNon-Observable Data Mining

Objects of Science Knowledge In Mind Non-Observable Descriptions Of Knowledge Observable Observed Data Quantitative Data Mathematical Formula Qualitative, Structures, Classification Ontology Texts

Objects Of Science Knowledge In Mind Non-Observable Descriptions Of Knowledge Observable Natural Language Incomplete System Diversity Ambiguity

Theories in Science Observed Data ObservableNon-Observable Data Mining

Objects of Science Knowledge In Mind Non-Observable Observable Observed Data Quantitative Data Mathematical Formula Qualitative, Structures, Classification Ontology Texts Descriptions Of Knowledge Data Mining + Text Mining

Knowledge in MindDescriptions of Knowledge Observable Non-Observable Characteristics Of Language Text Mining Objects of science Data Mining Characteristics Of Knowledge

Objects Of Science Knowledge In Mind Non-Observable Descriptions Of Knowledge Observable Natural Language Incomplete System Diversity Ambiguity

Objects Of Science Knowledge In Mind Non-Observable Descriptions Of Knowledge Observable Natural Language Incomplete System Diversity Ambiguity

My Talk 1.Background : Why NLP in Biomedicines 2. Examples of NLP in Biomedicines 3. Text Mining and NLP 4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition 5. Concluding Remarks

Terms are the basic units of knowledge Classification, Features NE recognition Event Recognition Semantic Disambiguation

Inconsistent naming conventions e.g. IL-2, IL2, Interleukin 2, Interleukin-2, Il-2 NF kappa B, NF-kappa B, (NF)-kappa B, NF-Kappa B, … Wide-spread synonymy Many synonyms in wide usage, e.g. PKB and Akt cycline-dependent kinase inhibitor p27, p27kip1, Open, growing vocabulary for many classes Cross-over of names between classes depending on context Protein vs DNA Frequent uses of coordination inside term formations Task difficulties in molecular-biology Linking Problem Diversity Lexicon Static Processing Term Recognition Ambiguity Context Dependent Dynamic Processing

Ambiguity Abbreviation Extraction ( Schwartz 2003 ) –Extracts short and long form pairs Short formLong form AAAlcoholic Anonymous American Americans Arachidonic acid arachidonic acid amino acid amino acids anaemia anemia :

Experiment [Tsuruoka, et.al. 03 SIGIR] Corpus –MEDLINE: the largest collection of abstracts in the biomedical domain Rule learning –83,142 abstracts –Obtained rules: 14,158 Evaluation –18,930 abstracts –Count the occurrences of each generated variant.

Results: “NF-kappa B” Generation Probability Generated VariantsFrequency 1.0 (Input)NF-kappa B NF-kappaB nF-kappa B Nf-kappa B NF kappa B NF-kappa b0 :::

Results: “antiinflammatory effect” Generation Probability Generated VariantsFrequency 1.0 (input)antiinflammatory effect anti-inflammatory effect antiinflammatory effects Antiinflammatory effect antiinflammatory-effect anti-inflammatory effects23 :::

Results: “tumour necrosis factor alpha” Generation Probability Generated VariantsFrequency 1.0 (Input)tumour necrosis factor alpha tumor necrosis factor alpha tumour necrosis factor-alpha Tumour necrosis factor alpha tumor necrosis factor alpha Tumor necrosis factor alpha8 :::

Inconsistent naming conventions e.g. IL-2, IL2, Interleukin 2, Interleukin-2, Il-2 NF kappa B, NF-kappa B, (NF)-kappa B, NF-Kappa B, … Wide-spread synonymy Many synonyms in wide usage, e.g. PKB and Akt cycline-dependent kinase inhibitor p27, p27kip1, Open, growing vocabulary for many classes Cross-over of names between classes depending on context Protein vs DNA Frequent uses of coordination inside term formations Task difficulties in molecular-biology Linking Problem Diversity Lexicon Static Ptocessing Term Recognition Ambiguity Context Dependent Dynamic Processing

Genia Ontology Substance +substance-+-compound-+-organic-+-nucleic_acid-+-poly_nucleotides | | | | +-nucleotide | | | | +-DNA | | | | +-RNA | | | +-amino_acid-+-peptide | | | | +-amino_acid_monomer | | | | +-protein | | | +-lipid | | | +-carbohydrate | | | +-other_organic_compounds | | +-inorganic | +-atom

Genia Ontology : Source +-source-+-natural-+-organism-+-multi_cell | | | +-mono_cell | | | +-virus | | +-body_part | | +-tissue | | +-cell_type | +-artificial-+-cell_line | +-other_artificial_sources

Number of Tagged Objects Texts: 2,500 MEDLINE Abstracts –Papers on Transcription Factors in Human blood cells –550,000 words, 20,000 sentences Tagged objects: 147,000 –Protein:~ 77,000 –DNA:~ 24,000 –RNA:~ 2,400 –Source:~ 27,000 –Other:~ 37,000

Distributions of Semantic Classes

Extension of GENIA Ontology Small classes (to be embedded in UMLS) –5242 terms labelled with ‘other_names’ class Events, Biological reactions 3800 Disease 636 –Names of Diseases 501 –Treatments 61 –Diagnoses 52 –Pathology 3 –Others 39 Experiments 578 –Methods 493 –Materials 25 –Others 60 Others 228

DNA PROTEIN DNA CELLTYPE and classify Thus, CIITA not only activates the expression of class II genes but recruits another B cell-specific coactivator to increase transcriptional activity of class II promoters in B cells. Recognize “names” in the text –Technical terms expressing proteins, genes, cells, etc. Biomedical NE Task (Collier Coling00,Kazama ACL02, Kim ISMB02) Identify

NE Task as Classification To a class (tag) representing the semantic class and the position in the term –The task is reduced to a tagging task We can use methods developed for tagging –The structure is encoded in a tag BIO (Begin, Inside, and Other) tagging … Term of class X B- X I- X o Term of class Y B- Y oooo Words: BIO tags: (OTHER)

NE Tagging Illustrated Classify a word depending on the context activity of class II promoters in B-DNAI-DNA conversion to features classifier N PNSymNsP context BIO tags: POS tags: OO Words: Deterministic tagging: - Only the most probable tag at each word (SVM) The Viterbi tagging: - The most probable sequence among all (probabilistic models)

The GENIA Corpus [Tateishi HLT02., Ohta PSB00, ISMB02] Annotated MEDLINE abstracts A gold standard for biomedical NLP tasks # of abstracts: # of sentences: # of tokens (words): # of named entities: # of semantic classes: 670 5, ,216 23, ,000-abstract version soon Big enough to: make SVM usage nontrivial Small enough to: make sparseness serious

the ME Method Maximum Entropy model Feature function Weight for F i Feature function: Target term Same as the feature in SVMs The Viterbi algorithm is used for tagging ContextTag

SOHMM modeling (J.KIM, et.al. ACL03) SOHMM modeling –No assumption is made arbitrarily. –Instead, a context classification function is induced from a corpus. SOHMM learning –Inducing the context classification function –Estimating parameters A set of contextual feature values which are visible at the moment of predicting. A classification function from sets of contextual feature values to context patterns grouped appropriately.

Experimental Results Biological source recognition Biological substance recognition Matching methodprecisionrecallF-score hard matching soft matching left soft matching right soft matching either Matching methodprecisionrecallF-score hard matching soft matching left soft matching right soft matching either

Event Recognition Identity of events in our mind Disambiguation of different events by context

Problem: Syntactic Variations RAF6 activates NF-kappaB. Lck is activated by autophosphorylation at Tyr 394. Anandamide induces vasodilation by activating vanilloid receptors. the activation of Rap1 by C3G the GTPase-activating protein rhoGAP the stress-activated group of MAP kinases ACTIVATOR activate ACTIVATEE

Verbs Related to Biological Events Frequent Verbs in 100 MEDLINE Abstracts

Argument Frame Extractor 133 argument structures, marked by a domain specialist in 97 sentences among the 180 sentences Extracted Uniquely Extracted with ambiguity Parsing Failures Extractable from pp’s Not extractable27 Memory limitation,etc17 68%

My Talk 1.Background : Why NLP in Biomedicines 2. Examples of NLP in Biomedicines 3. Text Mining and NLP 4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition 5. Concluding Remarks

Revolution in LT in the last decade Information Knowledge Language Texts Grammar Syntax-Semantic Mapping Interpretation based on Knowledge Machine Learning Knowledge Acquisition Statistical Biases Huge Ontology: Next Revolution ? Bio-Medical Application: UMLS, Gene Ontology, etc.

by D. Devos Genome sequencing. Actual demands in the real world with more homogenous user groups and more concrete criteria for evaluating results

Resources available Medline Abstracts (4000, about 1 million words) GENIA ontology POS tags Semantic tags Structural tags Co-reference annotations with a Singaporean team Lexical resources mapped to existing ontology