Gully A. Burns1, Pradeep Dasigi2, Eduard H. Hovy2

Slides:



Advertisements
Similar presentations
Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
Advertisements

Ontology Engineering approaches based on semi-automated curation of the primary literature Gully APC Burns, Tommy Ingulfsen, Donghui Feng and Ed Hovy Biomedical.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
NCBI/WHO PubMed/Hinari Course NCBI Literature Databases: PubMed Background.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
PubMed Central Mahyar Ahmadpour-B. Kowsar Publicatin Corp. Kowsar Editorial Meeting 1 September 19th, 2013 Tehran, Iran.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and.
The IntAct Database Sandra Orchard & Birgit Meldal.
LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.
Global Alignment and Collaboration Jo
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.
CBioC: Massive Collaborative Curation of Biomedical Literature Future Directions.
Common Core State Standards Professional Learning Module Series
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Moving beyond free text. Authors Scientist does research Scientist publishes research results in journal article Old Paradigm:
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
Chris Luszczek Biol2050 week 3 Lecture September 23, 2013.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
PattArAn – From Annotation Triplets to Sentence Fingerprints Motivation Motivation  Scientific concepts are annotated with controlled vocabulary (CV)
Flexible Text Mining using Interactive Information Extraction David Milward
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
How to Prepare an Annotated Bibliography
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
A Method for Protein Functional Flow Configuration and Validation Woo-Hyuk Jang 1 Suk-Hoon Jung 1 Dong-Soo Han 1
Course level learning objectivesRBC membrane analysis learning objectives 1. To master the core concepts in molecular biology, genetics, and cell biology.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
HITIQA: Scenario Based Question Answering Tomek Strzalkowski, et al The State University of New York at Albany Paul Kantor, et al Rutgers University Boris.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
InterPro Sandra Orchard.
Probabilistic Annotation Framework: Knowledge Assembly at Scale with Semantic and Probabilistic Techniques Szymon Klarman 1, Larisa Soldatova 1, Robert.
Lab Interactions and Ontologies LAB CBW Bioinformatics Workshop February 23 th 2006, Toronto Christopher Hogue Blueprint Initiative.
Use SIOC RDF format for representation of scientific statements Annotated statements created by manual curation automated extraction of biomedical literature.
Language Identification and Part-of-Speech Tagging
David Amar, Tom Hait, and Ron Shamir
Gene Expression Database (GXD)
Automatic Writing Evaluation
Automatically Labeled Data Generation for Large Scale Event Extraction
Scalable EEG interpretation using Deep Learning and Schema Descriptors
Deep Learning for Bacteria Event Identification
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
A Brief Introduction to Distant Supervision
STRING Large-scale data and text mining
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Bidirectional CRF for NER
Michigan Reading Standards
Florian Gräf Software Developer of the McEntyre group at EMBL-EBI
Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.
Resource Recommendation for AAN
Lesson 3 Bioinformatics Laboratory
INDRA Statements constructed from TRIPS NLP extractions, BioPAX, and BEL INDRA Statements constructed from TRIPS NLP extractions, BioPAX, and BEL An identical.
SAHS ACT Test Preparation 2018
Approaching the Science Gateway (GW-4)
By Hossein Hematialam and Wlodek Zadrozny Presented by
Building a model from natural language with INDRA
BOX #1 – D – Describe the document
Extracting Why Text Segment from Web Based on Grammar-gram
Construction of a Rice Glycosyltransferase Phylogenomic Database and Identification of Rice-Diverged Glycosyltransferases  Cao Pei-Jian , Bartley Laura.
Bidirectional LSTM-CRF Models for Sequence Tagging
Presentation transcript:

Extracting Evidence Fragments for Distant Supervision of Molecular Interactions Gully A. Burns1, Pradeep Dasigi2, Eduard H. Hovy2 1 Intelligent Systems Division, USC’s Information Sciences Institute 2 Language Technologies Institute - Carnegie Mellon University

Studying A Complex Molecular Interaction Innocenti et al Studying A Complex Molecular Interaction Innocenti et al. 2002 pmid: 11777939 “Thus, under physiological conditions, the coimmunoprecipitation of Eps8 and Sos-1 depends on the integrity of the Eps8–E3b1 interaction, pointing to the existence of a physiological S/E/E8 complex.”

Distant Supervision based on Subfigure Annotations

European Bioinformatics Institute @ Cambridge, England https://www.ebi.ac.uk/intact/ European Bioinformatics Institute @ Cambridge, England A large-scale, high-quality manually curated database Basic Data Structure: [PMID, Subfig, Protein1, Protein2, Protein3, Methods] Content ~14K curated papers in total ~1K open access 6320 molecular interactions from 899 papers with sub-figure references

Mid-Level Discourse Structure in Results Sections Context Expt 1A Expt 1B Expt 1C Interpretation Innocenti et al. 2002 (PMID:11777939)

Discourse Types

Low Level Discourse Structure in Experimental Text Innocenti et al. 2002 (PMC2173577)

SciDT – Scientific Discourse Tagger SciDT uses deep learning (Dasigi, et al. 2017. arXiv:1702.05398). Input data is a list of sub-sentence clauses in a given paragraph. (1) Use word2vec embeddings from Wikipedia + all PubMed abstracts + all PMC open access (Pysalo et al. 2013) to obtain D, this is converted via ‘Summarization’ to Dsumm,, which is then passed to a LSTM-RNN for labeling.

Discourse Tagging Performance Accuracies and F-scores from 5-fold cross validation of SciDT in various settings and a Conditional Random Field model baseline (Burns et al. 2016, Database). Dasigi, et al. 2017. arXiv:1702.05398

Heuristics for Identifying Evidence Fragments For each subfigure mention in text (e.g., ‘Fig. 1 A’) scan backwards for any of 3 conditions between consecutive sentences S1 and S2 to indicate that S2 is the first sentence of the evidence fragment: S1 contains (a) clauses that are tagged as ‘hypotheses’, ‘problems’, or ‘facts’ or (b) clauses that are tagged as ‘results’ or ‘implications’ that also contain external citations. S2 contains (a) clauses that are goals or methods or (b) results/implications with no external citations. Both S1 and S2 contain references to subfigures that are entirely disjoint S2 is a section heading … and scan forward for similar conditions to indicate that S1 is the last sentence of the evidence fragment.

Delineating Experiments Across Results Sections Innocenti et al. (2002) Small-scale manual evaluation of 5 papers (695 clauses, 133 figure references) Checking if each clause was correctly labeled for Subfigures Precision=0.53, Recall=0.94 F-Score=0.74.

The INTACT Evidence Fragment Corpus doi:10.6084/m9.figshare.5007992.v5 Data set defined as a Research Object INTACT data expressed as BioPax National Library of Medicine’s ‘BioC’ format http://bioc.sourceforge.org BioC Linked Data: http://purl.org/bioc Lightweight + Flexible Document / Passage / Annotation / Location ~20K Evidence Fragments Supplemented with SPAR + Dublin Core + PROV ontologies

Event Extraction with REACH System ‘Out-of-the-box’ https://github.com/clulab/reach A biomedical event extraction system Developed by Mihai Surdeanu’s group at University of Arizona http://clulab.org/ Applies an extensive library of extraction patterns to detect binding, phosphorylation, activation. Not designed (yet) to examine molecular interaction evidence Baseline performance: 43.5% correctly identifies a ‘Complex Assembly’ event 5.6% perfectly reconstructs INTACT record

Acknowledgements Anita de Waard Hans Chalupsky Sandra Orchard DARPA Big Mechanism under ARO contract W911NF-14-1-0436 ‘EvidX’ project (NLM) R01LM012592-01