Application of the NLP techniques to IE and IR CREST.

Slides:



Advertisements
Similar presentations
Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.
Advertisements

1 National Centre for Text Mining Mission To provide TM tools for users, in particular, scientists and researchers To coordinate activities in the TM community.
Extracting Disease-Gene Associations from MEDLINE abstracts Tsujii laboratory University of Tokyo.
Feature Forest Models for Syntactic Parsing Yusuke Miyao University of Tokyo.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Coreference Based Event-Argument Relation Extraction on Biomedical Text Katsumasa Yoshikawa 1), Sebastian Riedel 2), Tsutomu Hirao 3), Masayuki Asahara.
GENIA-GR: a Grammatical Relation Corpus for Parser Evaluation in the Biomedical Domain Yuka Tateisi 1, Yusuke Miyao 2, Kenji Sagae 2, Jun'ichi Tsujii 2,3.
Microarray Data Analysis Day 2
Knowledge Enabled Information and Services Science Schema-Driven Relationship Extraction from Unstructured Text Cartic Ramakrishnan Kno.e.sis Center, Wright.
Searching and Exploring Biomedical Data Vagelis Hristidis School of Computing and Information Sciences Florida International University.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and.
20,000 GENES IN HUMAN GENOME; WHAT WOULD HAPPEN IF ALL THESE GENES WERE EXPRESSED IN EVERY CELL IN YOUR BODY? WHAT WOULD HAPPEN IF THEY WERE EXPRESSED.
Molecular Genetics DNA RNA Protein Phenotype Genome Gene
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Gene Ontology John Pinney
6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive.
Archives and Information Retrieval
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS Term Project Proposal November 1, 2002 Sharon Diskin.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Part-of-speech tagging and chunking with log-linear models University of Manchester National Centre for Text Mining (NaCTeM) Yoshimasa Tsuruoka.
Protein Synthesis Ordinary Level. Lesson Objectives At the end of this lesson you should be able to 1.Outline the steps in protein synthesis 2.Understand.
Transcription Co-activator Family Proteins
Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.
SMBM Talks SMBM, Cambridge, April (Edinburgh May 2) NLP for Biomedical Text Mining.
Knowledge Integration for Gene Target Selection Graciela Gonzalez, PhD Juan C. Uribe Contact:
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Summarization using Event Extraction Base System 01/12 KwangHee Park.
Bioinformatics.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
 Eukaryotic Gene Expression.  Transduction  Transformation.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
NLP Tools for Biology Literature Mining Qiaozhu Mei Jing Jiang ChengXiang Zhai Nov 3, 2004.
Finish up array applications Move on to proteomics Protein microarrays.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
Predicting protein degradation rates Karen Page. The central dogma DNA RNA protein Transcription Translation The expression of genetic information stored.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Watson Genomic Analytics. Select Watson solutions address a wide range of clinical and research needs in oncology Patient InsightsEvidence-based InsightsResearch.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.
Mining the Biomedical Research Literature Ken Baclawski.
Bioinformatics and Computational Biology
Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Retrospective study of a gene by mining texts : The Hepcidin use-case Fouzia Moussouni-Marzolf.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Bio-Medical Text Mining with Python Jaganadh G Carlos Rodriguez-Penagos.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Protein association networks with STRING
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Applications of Text Mining
Improving a Pipeline Architecture for Shallow Discourse Parsing
A Zero-Knowledge Based Introduction to Biology
Schedule for the Afternoon
Fouzia Moussouni, Anita Burgun, Franck Le Duff,
Relationship between Genotype and Phenotype
Comparison of Nuclear, Eukaryotic RNA Polymerases
Overview Domains and conclusion Introduction Biological network data
By Hossein Hematialam and Wlodek Zadrozny Presented by
Presentation transcript:

Application of the NLP techniques to IE and IR CREST

Outline Background Building NLP resources GENIA Extracting Disease-Gene Associations from MEDLINE H-invitational Extracting DGAs by machine learning An IR system for predicate-argument relations MEDUSA

Application to the Biomedical domain Plenty of text MEDLINE database: 12 million abstracts Needs of effective IE and IR Domain knowledge Gene ontology, KEGG, UMLS, ICD, … Other Information sources A variety of molecular databases DNA sequences, motifs, diseases, molecular interactions, etc…

Developing NLP resources Resources for NLP research Domain knowledge Training data for ML-based techniques Test data for evaluating the transferability of a system We are now developing… GENIA Ontology Corpus

GENIA corpus 4,000 MEDLINE abstracts Selected by MeSH Terms (Human, Blood cells, Transcription factors) XML format Contents Named-entity (Kim et al 2003) Part-of-speech (Tateisi et al 2004) Parse tree Co-reference (Institute of Infocomm Research, Singapore)

The peri-kappa B site mediates human immunodeficiency virus type 2 enhancer activation in monocytes … GENIA named-entity corpus Terms are annotated based on the semantic classes in the GENIA ontology Size 2,000 abstracts Number of the terms: 92,723 Vocabulary size: 36,568 DNA virus cell_type

GENIA part-of-speech corpus Each token is annotated with its part-of-speech tag. Size 2,000 abstracts 20,544 sentences 50,1054 words (about half the size of Penn Treebank) The peri-kappa B site mediates human immunodeficiency virus type 2 enhancer activation in monocytes … DT NN NN NN VBZ JJ NN NN NN CD NN NN IN NNS

GENIA treebank Based on the standard of the Penn TreeBank Size 200 abstracts (1500 abstracts at the end of this fiscal year) CD3-episilon expression is controlled by a downstream T lymphocyte-specific enhancer element NP ADJP NP PP VP S

GENIA corpus Used in more than 240 institutions Japan (28), Asia (54), North America (63), Europe (62), etc… De facto standard for evaluating biomedical named-entity recognition systems BioNLP workshop at Coling 2004 Named-entity recognition shared task Institute for Infocomm Research (Singapore), Stanford University (USA), University of Edinburgh (UK), University of Wisconsin-Madison (USA), Pohang University of Science and Technology (Korea), University of Alberta (Canada), University Duisburg-Essen (Germany), Korea University (Korea), National Taiwan University (Taiwan),

Outline Background Building NLP resources GENIA Extracting Disease-Gene Associations from MEDLINE H-invitational Extracting DGAs by machine learning An IR system for predicate-argument relations MEDUSA

H-Invitational Disease Edition Text-mining Scoring system (PANDA) Known disease gene Genomic region of interest (GROI) List of genes Genes with high score SNPs 1)Public 2)Private Gene expression 1)Public 2)Private AND/OR Final Result H-InvDB Other DB Literature (PubMed) Dictionary Specific disease Select specific disease June 25, 2004 Disease group, JBIRC Synthetic analysis

Disease-Gene Associations extracted from MEDLINE DGA explorer (demo)

Text 1.5 million MEDLINE abstracts Selected by MeSH Terms Disease Category AND (Amino Acids, Peptides, and Proteins OR Genetic Structures) Parsing All the sentences were parsed by the HPSG parser Using a PC cluster (100 processors with GXP) Time: 10 days

Disease-Gene Associations in texts These results suggested that targeted disruption of Cyp19 caused anovulation and precocious depletion of ovarian follicles Furthermore, AML cells with methylated p15(INAK4B) tended to express higher levels of DNMT1 and 3B.

Training data All foals with OLWS were homozygous for the Ile118Lys EDNRB mutation, and adults that were homozygous were not found. Dominant radial drusen and Arg345Trp EFEMP1 mutation. The 5 year overall survival (OS) and event-free survival (EFS) were 94 and 90 +/- 8%, respectively, with a median follow-up of 48 months. These data may indicate that formation of parathyroid adenoma in young patients is related to a mechanism involving EGFR. All co-occurrences are classified into relevant or irrelevant by a domain expert.

Maximum entropy learning Log-linear model Feature function Weight Features Bag-of-words Local context Gene/disease name Predicate-argument structures :

Features of predicate- argument structures (1) Dedifferentiation of adenoid cystic carcinoma: report of a case implicating p53 gene mutation. X gene/disease ARG2

Features of predicate- argument structures (2) These results suggested that targeted disruption of Cyp19 caused anovulation and precocious depletion of ovarian follicles. Furthermore, AML cells with methylated p15(INAK4B) tended to express higher levels of DNMT1 and 3B. X disease/gene ARG2ARG1 gene/disease

Extraction accuracy Training/test data: 2,253 sentences 10-fold cross validation featuresrecallprecisionf-score N/A bag of words local context predicate- argument structures

Outline Background Building NLP resources GENIA Extracting Disease-Gene Associations from MEDLINE H-invitational Extracting DGAs by machine learning An IR system for predicate-argument relations MEDUSA

MEDUSA: An IR system for predicate-argument structures Ex. Search a sentence in which the subject of the verb activate is protein. Simple: Since the PHO2 Asp-230 mutant mimics Ser-230-phosphorylated PHO2, we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene. With a relative pronoun: Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription. Coordination: Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fais to localize the RNA to the posterior.

MEDUSA demonstration 100,000 MEDLINE abstracts Parsed by Enju Genes and diseases are annotated by using the UMLS dictionary

Summary GENIA corpus Parts of speech, Named-entities, Parse trees Extracting gene-disease associations from MEDLINE Machine learning with HPSG parse results An IR system for predicate-argument structures MEDUSA

Software and resource GENIA Named entity corpus Part-of-speech corpus Parse tree corpus Co-reference (Singapore) Part-of-speech tagger Named entity tagger (soon) HPSG parse results (100,00 MEDLINE abstracts) Enju (HPSG parser) MEDUSA LiLFeS Amis