DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar da Silva Co-Supervisor - Jörg Dieter Becker
DI FC UL2 Introduction: Context The central problems of post genomic era: Data Management Annotation of the Data Annotation is Crucial Main source of annotations : Published Literature
DI FC UL3 Introduction: Motivation & Objective The main motivations were, The lack of functional annotations To decrease the time and efforts for manual annotation Prediction of functions for genes from biomedical literature
DI FC UL4 Outline of Presentation Introduction Concepts ProFAL APEG Results Conclusions & Future Directions
DI FC UL5 Some Concepts Text Mining Ontology Information Modeling
DI FC UL6 Outline of Presentation Introduction Concepts ProFAL APEG Results Conclusions & Future Directions
DI FC UL7 Related Work: ProFAL Biological Database Literature Database ProFAL Annotations GOA GO Terms Validated Annotations FiGO Retrieval Extraction Validation Relevant Documents
DI FC UL8 Outline of Presentation Introduction Concepts ProFAL APEG Results Conclusions and Future Directions
DI FC UL9 APEG (Arabidopsis Pollen Expressed Genes) What is APEG ? Repository of Arabidopsis pollen expressed genes Web interface for different user types What are its contents? Results from expression studies Cross references to GenBank, SwissProt and TAIR Cross references to relevant literature Automatically extracted knowledge
DI FC UL10 ProFAL APEG : Class Model
DI FC UL11 Population of APEG Genome Chip Get probe set identifier Search at TAIR Get TAIR Id Get SwissProt Id Get GenBank Id Search at Pfam Get Family Input for APEG
DI FC UL12 Document Retrieval TAIR Id SwissProt Id GenBank Id Search at PubMed Get PubMed Id Get Abstract
DI FC UL13 Annotation Extraction GO Term from GOA lateral root morphogenesis
DI FC UL14
DI FC UL15
DI FC UL16
DI FC UL17
DI FC UL18
DI FC UL19
DI FC UL20
DI FC UL21
DI FC UL22
DI FC UL23 Outline of Presentation Introduction Concepts ProFAL APEG Results Conclusions and Future Directions
DI FC UL24 Automatic Extraction Inspection Comparison Manual Extraction ObservedExpected (By ProFAL)(By Curator) Results : Evaluation and Validation Document Retrieval
DI FC UL25 Results : Document Retrieval 55 distinct documents to 71 genes out of 147 genes (48%) using 117 distinct citations SP = SwissProt, GB = GenBank
DI FC UL26 Results : Annotation Extraction
DI FC UL27 Results : Observations Documents retrieved for 48% of genes. Low precision and recall
DI FC UL28 Results Analysis The main reason was An High number of false positives FP annotations were derived from: Terms other than Molecular Function GO Obsolete and Non existing GO terms In coherent Evidence text Evidence texts containing numbers, abbreviations and negation
DI FC UL29 Improved Results Improvements implemented Use of GO terms only from Molecular Function gene ontology Avoid obsolete and non existing GO terms
DI FC UL30 Discussion Specific Annotations Existing Vs Extracted Annotations – TP annotations for 20 genes out of 31 genes Probable Functions – 21 functions for 8 genes out of 31 genes
DI FC UL31 Outline of Presentations Introduction Concepts ProFAL Approach Results Conclusions and Future Directions
DI FC UL32 Conclusions APEG Database System Improvements in ProFAL In my opinion Text mining is useful for Biologists
DI FC UL33 Future Directions Improvement in Document Retrieval Integration of a NLP technique Usability study of the proposed approach Validation of approach with a larger set of genes
DI FC UL34 Key References Couto, F., Silva, M. & Coutinho, P. (2004). FiGO: Finding GO terms in unstructured text. EMBO BioCreative Workshop - Handouts, Granada, Spain. Becker, J.D., Boavida, L., Carneiro, J., Haury, M. & Feijó, J.A. (2003). Transcriptional profiling of Arabidopsis tissues reveals the unique characteristics of the pollen transcriptome. Plant Physiology, 133, Couto, F., Silva, M. & Coutinho, P. (2003). ProFAL: PROtein Functional Annotation through Literature. In E. Pimentel, N.R. Brisaboa & J. Gomez, eds., VIII Conference on Software Engineering and Databases (JISBD), , Alicante, Spain. Shatkay, H. & Feldman, R. (2003). Mining the biomedical literature in the genomic era: an overview. Journal of Computational Biology, 10, , PMID: Mack, R. & Hehenberger, M. (2002). Text-based knowledge discovery: search and mining of life-sciences documents. Drug Discovery Today, 7, S89-S98. The Gene Ontology Consortium (2001). Creating the Gene Ontology Resource: Design and Implementation. Genome Research, 11,
DI FC UL35 Thank you for your attention