Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS 630 - Term Project Proposal November 1, 2002 Sharon Diskin.

Slides:



Advertisements
Similar presentations
Extracting Disease-Gene Associations from MEDLINE abstracts Tsujii laboratory University of Tokyo.
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
University of Sheffield NLP Module 4: Machine Learning.
Publications Reviewed Searched Medline Hand screening of abstracts & papers Original study on human cancer patients Published in English before December.
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.
Distributional clustering of English words Authors: Fernando Pereira, Naftali Tishby, Lillian Lee Presenter: Marian Olteanu.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Introduction to Machine Learning Approach Lecture 5.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Text Classification, Active/Interactive learning.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Automatic Minirhizotron Root Image Analysis Using Two-Dimensional Matched Filtering and Local Entropy Thresholding Presented by Guang Zeng.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization Shubhanshu Mishra 1, Jana Diesner 1, Jason Byrne 2, Elizabeth.
Journal Club Meeting Sept 13, 2010 Tejaswini Narayanan.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Christian A. Cumbaa and Igor Jurisica Division of Signaling Biology, Ontario Cancer Institute, Toronto,
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
NA-MIC National Alliance for Medical Image Computing Evaluating Brain Tissue Classifiers S. Bouix, M. Martin-Fernandez, L. Ungar, M.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Wonjun Kim and Changick Kim, Member, IEEE
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers Zhaozheng Yin RI/CMU, Fall 2009.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Automatically Labeled Data Generation for Large Scale Event Extraction
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
A Brief Introduction to Distant Supervision
Erasmus University Rotterdam
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Damiano Bolzoni, Sandro Etalle, Pieter H. Hartel
CIS Term Project Proposal November 1, 2002 Sharon Diskin
SEG 4630 E-Commerce Data Mining — Final Review —
Volume 10, Issue 6, Pages (June 2018)
Discriminative Frequent Pattern Analysis for Effective Classification
Learning to rank 11/04/2017.
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS Term Project Proposal November 1, 2002 Sharon Diskin

Motivation Numerous biological databases are manually curated –painstakingly slow process, curators review relevant literature –any reliable automation of this process would be of great help Wealth of biological literature available –Medline currently contains over 12 Million journal articles Perhaps we can use the manually curated data in the biological databases to help with the task of information extraction from biological literature –Automatic annotation of abstracts or complete articles

Pilot Study: Genes and Disease OMIM MorbidMap Medline Abstracts Annotated Abstracts (Automatic) Andy Schein’s Work Annotated Abstracts (Manual) Does this co- occurrence of gene and phenotype belong to our “OMIM Relation”? This Term Project

Some Examples of Automated Annotation “In the present study, we screened four cell lines of human neuroblastoma (NB-1, NB-16, NB-19, and NH-6) for tumorigenicity and metastatic capacity in nude mice and found that NB-19 cells caused osteolytic lesions after s.c. injection into mice. “ “Angiotensin converting-enzyme (ACE) inhibitors decrease mortality after myocardial infarction among patients with depressed left ventricular function.” True Positive: False Positive:

Phase I – Feature Selection Analysis of Corpus –Interested in : binary classification of gene-phenotype pairs that co-occur in a given sentence – in our relation or not? –Question: where are the meaningful words located? Between gene and disease? Is it sufficient to only look at a single sentence? Vocabulary Selection –Simple bag of words with threshold –Top words based on mutual information Word Counts as Features –Raw counts of words vs. scaled counts Consider the use of positional information

Phase II – Maximum Entropy Model Estimate the conditional distribution of the class label given an instance of gene-phenotype co-occurrence –co-occur in a sentence –labeled instance represented as set of word count features Use the labeled training data to estimate the expected value of the word counts (features) for each class –training data used to set constraints on conditional probability Use Improved Iterative Scaling (IIS) to find a classifier of an exponential form which satisfies the constraints represented by the training data –calculate parameters of maximum entropy model

Phase III – Evaluation Cross Validation on Labeled Examples –Manual Annotation (Based on Andy’s review of automated annotation) –Automatic Annotation (Based on Andy’s pattern matching) Some Questions we are interested in: –What is the accuracy? –What are the sources or error? Poor Feature selection? Have we oversimplified the problem? How can we improve? –How much does our accuracy suffer if use only automatic annotation? Can we improve the automated annotation (and hence our classification accuracy?) –Does this method have potential for extracting information from the literature that does not yet exist in a structured database?

Potential Plans for the Future Consideration of other classification methods Potentially merge work with Ted’s Named Entity tagger for genes Try to do some information extraction –how is this gene involved in a given phenotype? Try with other databases –are issues similar?