Download presentation
Presentation is loading. Please wait.
1
Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS 630 - Term Project Proposal November 1, 2002 Sharon Diskin
2
Motivation Numerous biological databases are manually curated –painstakingly slow process, curators review relevant literature –any reliable automation of this process would be of great help Wealth of biological literature available –Medline currently contains over 12 Million journal articles Perhaps we can use the manually curated data in the biological databases to help with the task of information extraction from biological literature –Automatic annotation of abstracts or complete articles
3
Pilot Study: Genes and Disease OMIM MorbidMap Medline Abstracts Annotated Abstracts (Automatic) Andy Schein’s Work Annotated Abstracts (Manual) Does this co- occurrence of gene and phenotype belong to our “OMIM Relation”? This Term Project
4
Some Examples of Automated Annotation “In the present study, we screened four cell lines of human neuroblastoma (NB-1, NB-16, NB-19, and NH-6) for tumorigenicity and metastatic capacity in nude mice and found that NB-19 cells caused osteolytic lesions after s.c. injection into mice. “ “Angiotensin converting-enzyme (ACE) inhibitors decrease mortality after myocardial infarction among patients with depressed left ventricular function.” True Positive: False Positive:
5
Phase I – Feature Selection Analysis of Corpus –Interested in : binary classification of gene-phenotype pairs that co-occur in a given sentence – in our relation or not? –Question: where are the meaningful words located? Between gene and disease? Is it sufficient to only look at a single sentence? Vocabulary Selection –Simple bag of words with threshold –Top words based on mutual information Word Counts as Features –Raw counts of words vs. scaled counts Consider the use of positional information
6
Phase II – Maximum Entropy Model Estimate the conditional distribution of the class label given an instance of gene-phenotype co-occurrence –co-occur in a sentence –labeled instance represented as set of word count features Use the labeled training data to estimate the expected value of the word counts (features) for each class –training data used to set constraints on conditional probability Use Improved Iterative Scaling (IIS) to find a classifier of an exponential form which satisfies the constraints represented by the training data –calculate parameters of maximum entropy model
7
Phase III – Evaluation Cross Validation on Labeled Examples –Manual Annotation (Based on Andy’s review of automated annotation) –Automatic Annotation (Based on Andy’s pattern matching) Some Questions we are interested in: –What is the accuracy? –What are the sources or error? Poor Feature selection? Have we oversimplified the problem? How can we improve? –How much does our accuracy suffer if use only automatic annotation? Can we improve the automated annotation (and hence our classification accuracy?) –Does this method have potential for extracting information from the literature that does not yet exist in a structured database?
8
Potential Plans for the Future Consideration of other classification methods Potentially merge work with Ted’s Named Entity tagger for genes Try to do some information extraction –how is this gene involved in a given phenotype? Try with other databases –are issues similar?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.