Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from literature –That annotation of this function with a term in a controlled vocabulary Premise –If the document sets retrieved by a GeneRIF and a GO concept are similar then a link can be made between them
Data GeneRIF/GO term pairs –Paired if reference same MEDLINE article –Manually filtered for obvious errors –550 pairs from 335 distinct genes GO concept = GO term + definition GeneRIFs and GO concepts too short for simple keyword matching Treated as an IR problem –Similar to TREC novelty track –Compute relevance and similarity of 2 sentences
Document set - TREC Genomics 2003 docs Each sentence within GeneRIF/GO concept pair treated as IR query Similarity between the 2 computed based on top 200 docs retrieved by each query Best Recall = 78.2%(prec = 22.1%) Best Precision = 66.2% (rec = 46.9%)
GO Dependence Relations Previous work (PSB) –Using substring matching between GO codes –Derived from annotation databases, using vector space models, co-occurrence, association rule-mining. ChEBI: –Chemical Entities of Biological Interest –Preferred names + synonyms –IS_A (poly)hierarchy
methods String matching If the same ChEBI entity is used within 2 GO codes, they are in a dependence relationship –First order relationship –ChEBI term must be whole word or surrounded by punctuation, e.g. carbonic anhydrase activity is not related to carbon-oxygen lyase activity Also, in a dependence relationship with the ancestors –Second order relationship
Results 55% of GO terms contain a ChEBI entity 56% of dependent pairs with a ChEBI term found in PSB study were identified in this study Less than 1% of GO term pairs found in this study were identified by the PSB study Issues –How to validate potential relationships? –Usual naming/synonym ambiguity! –Substrings not used: imidazolonepropionase
Disease Text Classification Task: Classification of text into one of 26 disease classes Used full text and weighted sections according to information distribution published by other groups
Data Preparation HTML full text documents, semi automatic section division Tokenisation, Stemming, Stop word filtering, Part of speech tagging Dataset: 21*25 positive full text articles, 33 negative full text articles 10 fold cross validation Nearest centroid classifier
Results Baseline: 56% F-score Additional preprocessing: 67% –10,000 stopword filter –Only nouns Section weighting: 74% –Abstract and Introduction weighted highest
From Nonsense to Sense in Healthcare Questions Diagnosis, Prognosis, Therapy, Prevention medicine finds disease mechanisms by first finding cures –Currently by trial and error Try drug then test –Future - test then try drug Biomarkers –Normality -> dysfunction -> disease –There are prognostic markers before any diagnostic markers
Integrative Genomics Looking for hidden connections over wide field, e.g. –Immune system works too hard = rheumatoid arthritis –Immune system doesn’t work hard enough = infectious diseases
Term Disambiguation 40% of genes have homonym problem For 300 genes = 1mil MEDLINE articles After disambiguation = 60,000 articles 93% accuracy in asigning correct ID to ambiguous genes Use contectual fingerprints: –Experts choose 5 abstracts about a concept –Fingerprint then created for that concept