© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Co-Chair: Alexander Yeh, MITRE Corp. Data: FlyBase (http://www.flybase.org) July 2002 KDD Cup 2002 Task1:

Slides:

Advertisements

Similar presentations

GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.

Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.

Computational analysis of protein-protein interactions for bench biologists 2-8 September, Berlin Protein Interaction Databases Francesca Diella.

LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.

Gene Ontology John Pinney

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette.

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Lynette Hirschman The MITRE Corporation Bedford, MA, USA RegCreative Jamboree Nov 29-Dec 1, 2006 Text.

Evidence-Based Information Retrieval in Bioinformatics

Mining the Medical Literature Chirag Bhatt October 14 th, 2004.

B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego

Cis-Regulatory/ Text Mining Interface Discussion.

Knowledge Integration for Gene Target Selection Graciela Gonzalez, PhD Juan C. Uribe Contact:

KDD Cup Task 2 Mark Craven Department of Biostatistics & Medical Informatics Department of Computer Sciences University of Wisconsin

Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.

VectorBase A Resource Centre for Invertebrate Hosts of Human Pathogens Bob MacCallum Imperial College London.

Using The Gene Ontology: Gene Product Annotation.

Copyright OpenHelix. No use or reproduction without express written consent1.

CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)

Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.

IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,

Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,

Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso.

Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis.

Sub title here KDD Cup Task 1 Information Extraction from Biomedical Articles System Description June / July 2002.

Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:

CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

Toward a Unified Gene Page GMOD Meeting, April 2004 Don Gilbert,

Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.

The EST database is a collection of short single-read transcript sequences from GenBank. These sequences provide a resource to evaluate gene expression,

Associating Biomedical Terms: Case Study for Acetylation Aaron Buechlein Indiana University School of Informatics Advisor: Dr. Predrag Radivojac.

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

Gene Expression Omnibus (GEO)

© Ex Libris Ltd. All Rights Reserved. From Library Systems to Information SystemsMetaLib Jenny Walker ICOLC 2001.

CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)

This tutorial will describe how to navigate the section of Gramene that provides descriptions of alleles associated with morphological, developmental,

Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.

Introduction to the Gene Ontology GO Workshop 3-6 August 2010.

A collaborative tool for sequence annotation. Contact:

Bioinformatics and Computational Biology

Phenotype Curation Susan R. McCouch Department of Plant Breeding Cornell University.

Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.

PRO and the NIF / ImmPort Antibody Registries Alexander Diehl Protein Ontology Workshop 6/18/14.

Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.

Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.

SRI International Bioinformatics 1 Editing Pathway/Genome Databases Ron Caspi.

Genetic Literature Curation at FlyBase-Cambridge Steven Marygold ABC meeting, December 2007 A Database of.

Gene Set Analysis using R and Bioconductor Daniel Gusenleitner

BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.

CACAO Training Jim Hu and Suzi Aleksander Fall 2015.

BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™

David Amar, Tom Hait, and Ron Shamir

CACAO Training ASM-JGI 2012.

Concept Grounding to Multiple Knowledge Bases via Indirect Supervision

Annotating with GO: an overview

GO : the Gene Ontology & Functional enrichment analysis

Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Using ArrayExpress.

Data challenges in the pharmaceutical industry

Gene-set analysis Danielle Posthuma & Christiaan de Leeuw

Annotation: linking literature to gene products

Organization of the Drosophila Circadian Control Circuit

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

Strategies for annotation of a genome

1 Department of Engineering, 2 Department of Mathematics,

Citation-based Extraction of Core Contents from Biomedical Articles

SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.

Presentation transcript:

© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Co-Chair: Alexander Yeh, MITRE Corp. Data: FlyBase ( July 2002 KDD Cup 2002 Task1: Information Extraction from Biomedical Articles

© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Task Background Biomedical information exists in -Research literature -More than 280 semi-structured databases. Some examples: FlyBase: fruit fly genes and proteins Mouse Genome Database (MGB) Protein Information Resource (PIR) Some databases act as distillations of subsets of the literature Currently, curators (people) read the literature to manually update (curate) the databases

© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. FlyBase: Example of Data Curation

© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Curators Cannot Keep Up with the Literature! FlyBase References By Year

© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Task Rationale and Description We want to see what can be done to help automate this curation process Start fairly simple. Try to help automate part of what one group of curators needs to do: -Determine which papers need to be curated for fruit fly gene expression information “Curate” means read in detail to make entries in the database -Want to curate those papers containing experimental results on gene products (RNA transcripts and proteins)

© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Task Description: For a Set of Papers on Genetics or Molecular Biology Given for each paper -The full text of that paper -A list of the genes mentioned in that paper Determine for each paper -Does that paper contain any curatable gene product information (experimental results)? -For each gene mentioned in the paper, does that paper have experimental results for Transcript(s) of that gene? Protein(s) of that gene? Also produce a ranked list of the papers -Rank the curatable papers before the non-curatable papers

© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Task is Harder Than It First Appears Interested in results applicable to “regular” (found in the wild) flies, not mutants Genes have multiple names (synonyms) -Given a list of the known synonyms -But list may be incomplete -Some names can refer to more than one gene E.g., “Clk” is a symbol for one gene (Clock) and is also a synonym for another gene (period, symbol is “per”) Contestants given evidence of experimental results found in the training data, -But only in the form that is recorded in the FlyBase database

© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Evidence of Results Found in Training Data as Recorded in FlyBase Database (DB) records what evidence is found in a training paper, but not where in that paper The evidence is often recorded in a “normalized” form and domain knowledge is needed to find the corresponding text, e.g.,  DB: Assay mode: “immunolocalization” Text (PubMed ID# ): “ Figure 12. …Whole-mount tissue staining using an affinity- purified anti-PHM antibody in the CNS … This view displays only a portion of the CNS ” Term “immunolocalization” is not in the text Instead, text describes an instance of performing immunolocalization

© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Task Details Task has 3 sub-tasks, that contribute equally to the overall score -1. Ranked-list of papers (curatable before non- curatable) Look at area under ROC curve ROC = Receiver Operating Characteristic Prob(detection) versus Prob(false alarm) -2. Yes/No decisions on the papers being curatable (having any results of interest) Use balanced F-score -3. Yes/No decisions for having results for each type of product (transcript, protein) for each gene mentioned Use balanced F-score

© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Result Statistics 18 teams submitted 32 entries Entries from 7 “countries”: -Japan, Taiwan, Singapore, India, UK, Portugal, USA Team type (lead member, some teams had multiple types): -Company: 7 -University: 8 -Other: 3

© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. WINNER Combined team from ClearForest and Celera -Contacts: Yizhar Regev, Michal Finkelstein -Also had the best score in each of the 3 sub-tasks Best Median Ranked-list: 84% 69% Yes/No curate paper: 78% 58% Yes/No gene products: 67% 35% -The top entries for the ranked-list sub-task all had close scores for this sub-task

© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Honorable Mentions (East to West): Combined team from the Design Technology Institute Ltd., the Mechanical Engineering Dept. at the National University of Singapore, and the Genome Institute of Singapore -Contact: Shi Min Combined team from the data mining group at Imperial College (UK) and Inforsense Limited -Contacts: Huma Lodhi, Yong Zhang Combined team from Verity, Inc. and Exelixis, Inc. -Contact: Bin Chen

© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Acknowledgements People at FlyBase, especially -William Gelbart -Beverly Matthews -Leyla Bayraktaroglu -David Emmert -Don Gilbert Project team at MITRE -Alexander Morgan -Lynette Hirschman