Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Multimedia Database Systems
Improved TF-IDF Ranker
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
WMES3103 : INFORMATION RETRIEVAL
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS Term Project Proposal November 1, 2002 Sharon Diskin.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Vector Space Model CS 652 Information Extraction and Integration.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Chapter 5: Information Retrieval and Web Search
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Cis-Regulatory/ Text Mining Interface Discussion.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Flexible Text Mining using Interactive Information Extraction David Milward
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Information Retrieval
Modern Information Retrieval Lecture 2: Key concepts in IR.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
ASSOCIATIVE BROWSING Evaluating 1 Jinyoung Kim / W. Bruce Croft / David Smith for Personal Information.
Concept-based Short Text Classification and Ranking
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Lecture 12: Relevance Feedback & Query Expansion - II
CIS Term Project Proposal November 1, 2002 Sharon Diskin
Multimedia Information Retrieval
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Literature retrieval for personalized cancer treatment
Presentation transcript:

Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from literature –That annotation of this function with a term in a controlled vocabulary Premise –If the document sets retrieved by a GeneRIF and a GO concept are similar then a link can be made between them

Data GeneRIF/GO term pairs –Paired if reference same MEDLINE article –Manually filtered for obvious errors –550 pairs from 335 distinct genes GO concept = GO term + definition GeneRIFs and GO concepts too short for simple keyword matching Treated as an IR problem –Similar to TREC novelty track –Compute relevance and similarity of 2 sentences

Document set - TREC Genomics 2003 docs Each sentence within GeneRIF/GO concept pair treated as IR query Similarity between the 2 computed based on top 200 docs retrieved by each query Best Recall = 78.2%(prec = 22.1%) Best Precision = 66.2% (rec = 46.9%)

GO Dependence Relations Previous work (PSB) –Using substring matching between GO codes –Derived from annotation databases, using vector space models, co-occurrence, association rule-mining. ChEBI: –Chemical Entities of Biological Interest –Preferred names + synonyms –IS_A (poly)hierarchy

methods String matching If the same ChEBI entity is used within 2 GO codes, they are in a dependence relationship –First order relationship –ChEBI term must be whole word or surrounded by punctuation, e.g. carbonic anhydrase activity is not related to carbon-oxygen lyase activity Also, in a dependence relationship with the ancestors –Second order relationship

Results 55% of GO terms contain a ChEBI entity 56% of dependent pairs with a ChEBI term found in PSB study were identified in this study Less than 1% of GO term pairs found in this study were identified by the PSB study Issues –How to validate potential relationships? –Usual naming/synonym ambiguity! –Substrings not used: imidazolonepropionase

Disease Text Classification Task: Classification of text into one of 26 disease classes Used full text and weighted sections according to information distribution published by other groups

Data Preparation HTML full text documents, semi automatic section division Tokenisation, Stemming, Stop word filtering, Part of speech tagging Dataset: 21*25 positive full text articles, 33 negative full text articles 10 fold cross validation Nearest centroid classifier

Results Baseline: 56% F-score Additional preprocessing: 67% –10,000 stopword filter –Only nouns Section weighting: 74% –Abstract and Introduction weighted highest

From Nonsense to Sense in Healthcare Questions Diagnosis, Prognosis, Therapy, Prevention medicine finds disease mechanisms by first finding cures –Currently by trial and error Try drug then test –Future - test then try drug Biomarkers –Normality -> dysfunction -> disease –There are prognostic markers before any diagnostic markers

Integrative Genomics Looking for hidden connections over wide field, e.g. –Immune system works too hard = rheumatoid arthritis –Immune system doesn’t work hard enough = infectious diseases

Term Disambiguation 40% of genes have homonym problem For 300 genes = 1mil MEDLINE articles After disambiguation = 60,000 articles 93% accuracy in asigning correct ID to ambiguous genes Use contectual fingerprints: –Experts choose 5 abstracts about a concept –Fingerprint then created for that concept