Information Extraction from Literature Yue Lu BeeSpace Seminar Oct 24, 2007.

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

FP7 meeting - Gent - Carlos Rodríguez - April 18 WP4: Conceptual Mining from Text for Knowledge Engineering State of the Art WP Coordinators: Alfonso Valencia.
An Information Retrieval and Extraction System for C. elegans Literature.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Differential Gene Expression
LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.
Ontology annotation: mapping genomic regions biological function Paul D Thomas, Huaiyu Mi and Suzanna Lewis.
Introduction to Natural Language Processing Phenotype RCN Meeting Feb 2013.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
TRANSFAC Project Roadmap Discussion.  Structure DNA-binding domain (DBD)  The portion (domain) of the transcription factor that binds DNA Trans-activating.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
1 CIS607, Fall 2006 Semantic Information Integration Instructor: Dejing Dou Week 10 (Nov. 29)
Web Mining Research: A Survey
Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS Term Project Proposal November 1, 2002 Sharon Diskin.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Cis-Regulatory/ Text Mining Interface Discussion.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Knowledge Integration for Gene Target Selection Graciela Gonzalez, PhD Juan C. Uribe Contact:
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
BeeSpace Informatics Research: From Information Access to Knowledge Discovery ChengXiang Zhai Nov. 7, 2007.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Cell Signaling Ontology Takako Takai-Igarashi and Toshihisa Takagi Human Genome Center, Institute of Medical Science, University of Tokyo.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Tutorial session 3 Network analysis Exploring PPI networks using Cytoscape EMBO Practical Course Session 8 Nadezhda Doncheva and Piet Molenaar.
A Biology Primer Part IV: Gene networks and systems biology Vasileios Hatzivassiloglou University of Texas at Dallas.
DAVID R. SMITH DR. MARY DOLAN DR. JUDITH BLAKE Integrating the Cell Cycle Ontology with the Mouse Genome Database.
BeeSpace Informatics Research: From Information Access to Knowledge Discovery ChengXiang Zhai Nov. 14, 2007.
Biological Networks & Systems Anne R. Haake Rhys Price Jones.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.
A System for Finding Biological Entities that Satisfy Certain Conditions from Texts Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng.
Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
RNA-ligand interactions and control of gene expression
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Genomics research paper presentation
Semantic Processing with Context Analysis
Biomedical Text Mining and Its Applications
Mental Functioning and the Gene Ontology
CIS Term Project Proposal November 1, 2002 Sharon Diskin
Relationship between Genotype and Phenotype
9 Future Challenges for Bioinformatics
Unit III Information Essential to Life Processes
Batyr Charyyev.
Relationship between Genotype and Phenotype
Web Mining Research: A Survey
Relationship between Genotype and Phenotype
Presentation transcript:

Information Extraction from Literature Yue Lu BeeSpace Seminar Oct 24, 2007

Outline Overview of BeeSpace v4 Entity Recognition Relation Extraction

Overview BeeSpace V4  deeper semantic base than the current v3 system  entities and relations VS mutual information Four levels  Level1: Entity Recognition  Level2: Entity Association Mining  Level3: Relation Extraction  Level4: Inference and Hypothesis Generation

Overview Level1: Entity Recognition (detailed later) Level2 Entity Association Mining  Suppose entities are properly tagged  Utilize the co-occurrence patterns of entities to extract semantics  e.g. a bee biologist may want to know which genes are important for foraging behavior.  Similar to TREC Genomics 2007 task

TREC Genomics 2007 e.g. “Which [PATHWAYS] are possibly involved in the disease ADPKD?” currently only retrieval techniques  Gene synonym expansion  Conjunctive query interpretation  User relevance feedback tagged Entities definitely would help

Overview Level3: Relation Extraction  Goal is to extract the relations between entities  Generally requires entities to be properly tagged first  Detailed later Level4: Inference and Hypothesis Generation  Inference on knowledge base  Graph mining

Outline Overview of BeeSpace v4 Entity Recognition Relation Extraction

Entity Recognition Gene Example:  Although mxp and Pb display very similar expression patterns, pb null embryos develop normally

Entity Recognition Anatomy Example:  In normal embryos, mxp is expressed in the maxillary and labial segments, whereas ectopic expression is observed in some GOF variants.

Entity Recognition Biological process Example:  Amongst these are the Bicoid, the Nanos, and the terminal class gene products, some of which are oncoproteins involved in signal transduction for the formation of terminal structures in the embryo.

Entity Recognition Pathways Example:  Several signal transduction pathways have been described in Drosophila, and this review explores the potential of oncogene studies using one of those pathways - the terminal class signal transduction pathway - to better understand the cellular mechanisms of proto- oncogenes that mediate cellular responses in vertebrates including humans

Entity Recognition Protein family Example:  While non-arthropod orthologs have been found for many Drosophila eye developmental genes, this has not been the case for the glass (gl) gene, which encodes a zinc finger transcription factor required for photoreceptor cell specification, differentiation, and survival.

Entity Recognition CRE (cis-regulatory elements) Example:  A synthetic, 23-bp ecdysterone regulatory element (EcRE), derived from the upstream region of the Drosophila melanogaster hsp27 gene, was inserted adjacent to the herpes simplex virus thymidine kinase promoter fused to a bacterial gene for chloramphenicol acetyltransferase (CAT).

Entity Recognition Phenotype Definition:  a set of observable physical characteristics of an individual organism Example:  Fog, dumpy

Entity Recognition Class1: Small Variation (Dictionary/Ontology)  Organism, Anatomy, Biological Process, Pathway, Protein Family Class2: Medium Variation  Gene, cis Regulatory Element Class3: Large Variation  Phenotype, Behavior

Entity Recognition Generally can be defined as a classification problem Boils down to feature definition  Class1: matching a word in the Dictionary/Ontology  Class2: prefix/suffix of the word, POS tags, …  Class3:?

Entity Recognition Firstly focus on Class1  Relatively simple Class2 and Class3 need training examples  Useful in entity association mining  Useful in facilitating extraction of many interesting relations Related work: Textpresso

Textpresso Input: full text C. elegans literature Output: tagged XML format Defined a Textpresso ontology  First category is biological entities manually curated a lexicon of names Implemented by PERL regular expressions We could reuse some of the regular expressions

Entity Recognition OrganismEntrez gene table, Textpresso, BeeSpace DB AnatomyFlyBase Biological Process, Cellular Component, Molecular Function Textpresso PathwayKEGG Protein FamilyPDB, NCBI Resources:

Outline Overview of BeeSpace v4 Entity Recognition Relation Extraction

Expression Location  the expression of a gene in some location (tissues, body parts) Homology/Orthology  one gene is homologous to another gene

Relation Extraction Biological process  one gene has some role in a biological process Genetic/Physical/Regulatory Interaction  one gene interacts with another gene in a certain fashion (3 types of relations)  a simple case: Protein-Protein Interaction (PPI)

Relation Extraction Generally can be defined as a classification problem, which requires training data Domain adaptation?  an example of PPI

PPI Problem Definition:  Gene/protein names are already tagged  A known list of interaction words 133 words  classify each tuple (p1, p2, interWord) in one single sentence

PPI Methods  Learning algorithm: Maximum Entropy  Context features “Extracting protein-protein interactions using simple contextual features training data” BioNLP Workshop on HLT-NAACL 06 e.g. lexical forms, POS tags … Less dependent on domain

PPI Training/Testing data:  BioCreative  1000 hand labeled sentences, 3964 tuples  5-fold cross validation Performance  avgpr =  avgre =  avgf1 =

PPI Training data:  BioCreative  1000 hand labeled sentences, 3964 tuples Testing Data (different domain)  Bee collection Performance (Judged by Moushumi)  Total number of tuples extracted as PPI instances: 92  Precision: 63%

PPI Misclassification examples Type1: No interaction Sentence: Pretreatment of platelet suspension with phospholipase A2 from N. naja atra or A. mellifera venom (50.mu.g/ml) inhibited platelet aggregation induced by sodium arachidonate or collagen, but not induced by thrombin or ionophore A False: (collagen, thrombin, induced) True: relation between protein and platelet aggregation; no PPI

PPI Misclassification examples Type2: Incorrect interaction word Sentence: IgG antibody was able to inhibit binding of IgE antibody in the PLA radioallergsorbent test (RAST) from 10-40% at a molar excess of 10- to 1000-fold. False: (IgG antibody, IgE antibody, binding) True: (IgG antibody, IgE antibody, inhibit)

PPI Misclassification examples Type3: Incorrect protein involved Sentence: AChE exhibits a butyrylcholinesterase (BuChE) activity that represents about 14% of AChE activity. False: (AChE, AChE, exhibits) True: (AChE, BuChE, exhibits )

PPI Possible Improvement  syntactic patterns: “Optimizing syntax-patterns for discovering protein-protein interactions” In Proc ACM Symposium on Applied Computing, SAC, Bioinformatics Track,  parse tree  dependency parsing  …

The End