Presentation is loading. Please wait.

Presentation is loading. Please wait.

CIS Term Project Proposal November 1, 2002 Sharon Diskin

Similar presentations


Presentation on theme: "CIS Term Project Proposal November 1, 2002 Sharon Diskin"— Presentation transcript:

1 CIS 630 - Term Project Proposal November 1, 2002 Sharon Diskin
Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS Term Project Proposal November 1, 2002 Sharon Diskin

2 Motivation Numerous biological databases are manually curated
painstakingly slow process, curators review relevant literature any reliable automation of this process would be of great help Wealth of biological literature available Medline currently contains over 12 Million journal articles Perhaps we can use the manually curated data in the biological databases to help with the task of information extraction from biological literature Automatic annotation of abstracts or complete articles

3 Pilot Study: Genes and Disease
Classification OMIM MorbidMap Annotated Abstracts Annotated Abstracts Annotated Abstracts (Automatic) Does this co-occurrence of gene and phenotype belong to our “OMIM Relation”? WordFreak Pattern Matching Annotated Abstracts Medline Abstracts Annotated Abstracts Annotated Abstracts (Manual) OMIM overview of genes and genetic phenotypes (termed ‘disease’ from here on out) started by Victor McKusik at Johns Hopkins in late 50’s maintained by researchers at Johns Hopkins and around world - derived from biological literature made avail to public through NCBI (National Center for Biotechnology Information) at NIH approx diseases cataloged - simple mendelian as well as some complex Medline over 12 million jounal articles Andy Schein’s Work This Term Project

4 Some Examples of Automated Annotation
True Positive: “Angiotensin converting-enzyme (ACE) inhibitors decrease mortality after myocardial infarction among patients with depressed left ventricular function.” False Positive: ACE - enzyme involved in blood pressure regulation - involved in suseptability to myocardial infarction (heart attack) NB - here NB is referring to cell lines and not a gene. Want to be able to distinguish between these…. Classification task divided into 3 phases feature selection model building evalutation “In the present study, we screened four cell lines of human neuroblastoma (NB-1, NB-16, NB-19, and NH-6) for tumorigenicity and metastatic capacity in nude mice and found that NB-19 cells caused osteolytic lesions after s.c. injection into mice. “

5 Phase I – Feature Selection
Analysis of Corpus Interested in : binary classification of gene-phenotype pairs that co-occur in a given sentence – in our relation or not? Question: where are the meaningful words located? Between gene and disease? Is it sufficient to only look at a single sentence? Vocabulary Selection Simple bag of words with threshold Top words based on mutual information Word Counts as Features Raw counts of words vs. scaled counts Consider the use of positional information If only look at sentence level, then perhaps no need to scale

6 Phase II – Maximum Entropy Model
Estimate the conditional distribution of the class label given an instance of gene-phenotype co-occurrence co-occur in a sentence labeled instance represented as set of word count features Use the labeled training data to estimate the expected value of the word counts (features) for each class training data used to set constraints on conditional probability Use Improved Iterative Scaling (IIS) to find a classifier of an exponential form which satisfies the constraints represented by the training data calculate parameters of maximum entropy model Max Ent - should prefer most uniform

7 Phase III – Evaluation Cross Validation on Labeled Examples
Manual Annotation (Based on Andy’s review of automated annotation) Automatic Annotation (Based on Andy’s pattern matching) Some Questions we are interested in: What is the accuracy? What are the sources or error? Poor Feature selection? Have we oversimplified the problem? How can we improve? How much does our accuracy suffer if use only automatic annotation? Can we improve the automated annotation (and hence our classification accuracy?) Does this method have potential for extracting information from the literature that does not yet exist in a structured database?

8 Potential Plans for the Future
Consideration of other classification methods Potentially merge work with Ted’s Named Entity tagger for genes Try to do some information extraction how is this gene involved in a given phenotype? Try with other databases are issues similar?


Download ppt "CIS Term Project Proposal November 1, 2002 Sharon Diskin"

Similar presentations


Ads by Google