BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.

Slides:



Advertisements
Similar presentations
Yansong Feng and Mirella Lapata
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Chapter 5: Introduction to Information Retrieval
Multimedia Database Systems
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.
Predicting Gene Functions from Text Using a Cross- Species Approach Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.
Evaluating the Performance of IR Sytems
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Automatic Collection “Recruiter” Shuang Song. Project Goal Given a collection, automatically suggest other items to add to the collection  Design a process.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1/24 Learning to Extract Genic Interactions Using Gleaner LLL05 Workshop, 7 August 2005 ICML 2005, Bonn, Germany Mark Goadrich, Louis Oliphant and Jude.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann European Bioinformatics Institute, Wellcome Trust Genome.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Web- and Multimedia-based Information Systems Lecture 2.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Information Retrieval
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Reference Collections: Collection Characteristics.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Answering Gene Ontology terms to proteomics questions by supervised macro reading in MEDLINE Julien Gobeill 1, Emilie Pasche 2, Douglas Teodoro 2, Anne-Lise.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Lecture 12: Relevance Feedback & Query Expansion - II
Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.
CSE 635 Multimedia Information Retrieval
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Predicting Gene Functions from Text Using a Cross-Species Approach
Presentation transcript:

BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products

Task description The assignment of GO annotations to human proteins This is currently done by curators at Swiss-Prot The full text of journal articles was used (636 training docs from J. of Biological Chemistry) Tree subtasks

Subtasks 1)“Recover” text that provides evidence for the GO annotation: Given a (doc, protein, GO term) triplet, find the segment of text supporting this annotation 2)Provide GO annotation for human proteins: Given a (doc, protein) pair, return all GO terms that could be associated with this pair 3)Selection of relevant papers: detect which papers are relevant for a protein in the sense that they contain information that would be suitable to derive a GO annotation and provide the evidence text

Evaluation The prediction were made in form of triplets (protein, paper, GO) plus a piece evidence text More than 30,000 of these individual results were submitted and had to be reviewed by the GO curators The scheme for both GO and proteins was “high”: meaning that the GO term or the protein were correct “generally”: for GO terms this means that the specific protein is not there but a homologue from another organism or a reference to the protein family “low”: the prediction was wrong

Results – Task 2.1

Cont ’ d

Result – Task 2.2

Cont ’ d

Summary of approaches adopted by some participants User17: Soumya Ray and Mark Craven (University of Wisconsin) User20: Francisco M. Couto et al. (from Portugal and France) User4: Frédéric Ehrler and Patrick Ruch (University (Hospital) of Geneva)

Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text (User17)

Informative Term Model Identify terms that are characteristic of a given GO term Collect training data from other organism databases – SGD, MGI, RGD, TAIR Perform a chi-squared test to identify the informative terms Null hypothesis: the distributions of a term in the two classes (support and background) are identical

Cont ’ d Support set: a set of articles and abstracts associated with the GO term Background set: the remaining set of articles and abstracts

FiGO: Finding GO Terms in Unstructured Text (User20) Calculate the information content of each word occurring in the GO terms, where #w is the number of GO terms whose name contains w, and #max is the maximum number of GO terms whose name contains a common word The information content of a term’s name n is therefore: A GO term may have multiple names (synonyms):

Annotation with a piece of text Given a piece of text, the local information content of each term is defined as follow: FiGO identifies a term in a piece of text, when its local information content is sufficiently close to its information content:, where  [0,1] representing how close LIC should be from IC to decide that t is referred in p. Thus the parameter  controls the recall and precision of FiGO.

Preliminary Report on the BioCreative Experiment: Task Presentation, System Description and Preliminary Results An IR approach Index the collection of GO terms as if they are documents Each document (MedLine abstract) as a query to be categorized in GO categories Combine two retrieval engines: a vector space model (TFIDF) and a pattern-matcher Two types of indexing unit: stems (Porter-like) and linguistically motivated phrases (noun phrases) The UMLS is also used for string normalization

Summary IR-like approaches generate higher recall Almost all approaches depend on the collection of GO terms GO terms expansion (synonyms, related terms/phrases) seems important