Download presentation
Presentation is loading. Please wait.
Published byKerry Preston Modified over 9 years ago
1
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from literature –That annotation of this function with a term in a controlled vocabulary Premise –If the document sets retrieved by a GeneRIF and a GO concept are similar then a link can be made between them
2
Data GeneRIF/GO term pairs –Paired if reference same MEDLINE article –Manually filtered for obvious errors –550 pairs from 335 distinct genes GO concept = GO term + definition GeneRIFs and GO concepts too short for simple keyword matching Treated as an IR problem –Similar to TREC novelty track –Compute relevance and similarity of 2 sentences
3
Document set - TREC Genomics 2003 docs Each sentence within GeneRIF/GO concept pair treated as IR query Similarity between the 2 computed based on top 200 docs retrieved by each query Best Recall = 78.2%(prec = 22.1%) Best Precision = 66.2% (rec = 46.9%)
4
GO Dependence Relations Previous work (PSB) –Using substring matching between GO codes –Derived from annotation databases, using vector space models, co-occurrence, association rule-mining. ChEBI: www.ebi.ac.uk/chebi/ –Chemical Entities of Biological Interest –Preferred names + synonyms –IS_A (poly)hierarchy
5
methods String matching If the same ChEBI entity is used within 2 GO codes, they are in a dependence relationship –First order relationship –ChEBI term must be whole word or surrounded by punctuation, e.g. carbonic anhydrase activity is not related to carbon-oxygen lyase activity Also, in a dependence relationship with the ancestors –Second order relationship
6
Results 55% of GO terms contain a ChEBI entity 56% of dependent pairs with a ChEBI term found in PSB study were identified in this study Less than 1% of GO term pairs found in this study were identified by the PSB study Issues –How to validate potential relationships? –Usual naming/synonym ambiguity! –Substrings not used: imidazolonepropionase
7
Disease Text Classification Task: Classification of text into one of 26 disease classes Used full text and weighted sections according to information distribution published by other groups
8
Data Preparation HTML full text documents, semi automatic section division Tokenisation, Stemming, Stop word filtering, Part of speech tagging Dataset: 21*25 positive full text articles, 33 negative full text articles 10 fold cross validation Nearest centroid classifier
9
Results Baseline: 56% F-score Additional preprocessing: 67% –10,000 stopword filter –Only nouns Section weighting: 74% –Abstract and Introduction weighted highest
10
From Nonsense to Sense in Healthcare Questions Diagnosis, Prognosis, Therapy, Prevention medicine finds disease mechanisms by first finding cures –Currently by trial and error Try drug then test –Future - test then try drug Biomarkers –Normality -> dysfunction -> disease –There are prognostic markers before any diagnostic markers
11
Integrative Genomics Looking for hidden connections over wide field, e.g. –Immune system works too hard = rheumatoid arthritis –Immune system doesn’t work hard enough = infectious diseases
12
Term Disambiguation 40% of genes have homonym problem For 300 genes = 1mil MEDLINE articles After disambiguation = 60,000 articles 93% accuracy in asigning correct ID to ambiguous genes Use contectual fingerprints: –Experts choose 5 abstracts about a concept –Fingerprint then created for that concept
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.