CIS Term Project Proposal November 1, 2002 Sharon Diskin

Slides:



Advertisements
Similar presentations
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Archives and Information Retrieval
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
. Differentially Expressed Genes, Class Discovery & Classification.
Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS Term Project Proposal November 1, 2002 Sharon Diskin.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Introduction to Machine Learning Approach Lecture 5.
1  The goal is to estimate the error probability of the designed classification system  Error Counting Technique  Let classes  Let data points in class.
Dermatology 2006 SNU Dermatolory Lab Bioinformatics for Genomic Medicine 2006 Dermatology Lab Yoonkyung Kim 0 Term Project Proposal Presentation 2006.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Active Learning for Class Imbalance Problem
Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Text Classification, Active/Interactive learning.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan.
Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Automatic Minirhizotron Root Image Analysis Using Two-Dimensional Matched Filtering and Local Entropy Thresholding Presented by Guang Zeng.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Journal Club Meeting Sept 13, 2010 Tejaswini Narayanan.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Christian A. Cumbaa and Igor Jurisica Division of Signaling Biology, Ontario Cancer Institute, Toronto,
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
Wonjun Kim and Changick Kim, Member, IEEE
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers Zhaozheng Yin RI/CMU, Fall 2009.
1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Gene Expression Database (GXD)
Automatically Labeled Data Generation for Large Scale Event Extraction
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
A Brief Introduction to Distant Supervision
Erasmus University Rotterdam
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS
Terminology problems in literature mining and NLP
SEG 4630 E-Commerce Data Mining — Final Review —
Presenter: Usman Sajid
Project 1 Binary Classification
Discriminative Frequent Pattern Analysis for Effective Classification
Introduction Task: extracting relational facts from text
Family History Technology Workshop
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Presentation transcript:

CIS 630 - Term Project Proposal November 1, 2002 Sharon Diskin Classification of Gene-Phenotype Co-Occurences in Biological Literature Using Maximum Entropy CIS 630 - Term Project Proposal November 1, 2002 Sharon Diskin

Motivation Numerous biological databases are manually curated painstakingly slow process, curators review relevant literature any reliable automation of this process would be of great help Wealth of biological literature available Medline currently contains over 12 Million journal articles Perhaps we can use the manually curated data in the biological databases to help with the task of information extraction from biological literature Automatic annotation of abstracts or complete articles

Pilot Study: Genes and Disease Classification OMIM MorbidMap Annotated Abstracts Annotated Abstracts Annotated Abstracts (Automatic) Does this co-occurrence of gene and phenotype belong to our “OMIM Relation”? WordFreak Pattern Matching Annotated Abstracts Medline Abstracts Annotated Abstracts Annotated Abstracts (Manual) OMIM overview of genes and genetic phenotypes (termed ‘disease’ from here on out) started by Victor McKusik at Johns Hopkins in late 50’s maintained by researchers at Johns Hopkins and around world - derived from biological literature made avail to public through NCBI (National Center for Biotechnology Information) at NIH approx. 14000 diseases cataloged - simple mendelian as well as some complex Medline over 12 million jounal articles Andy Schein’s Work This Term Project

Some Examples of Automated Annotation True Positive: “Angiotensin converting-enzyme (ACE) inhibitors decrease mortality after myocardial infarction among patients with depressed left ventricular function.” False Positive: ACE - enzyme involved in blood pressure regulation - involved in suseptability to myocardial infarction (heart attack) NB - here NB is referring to cell lines and not a gene. Want to be able to distinguish between these…. Classification task divided into 3 phases feature selection model building evalutation “In the present study, we screened four cell lines of human neuroblastoma (NB-1, NB-16, NB-19, and NH-6) for tumorigenicity and metastatic capacity in nude mice and found that NB-19 cells caused osteolytic lesions after s.c. injection into mice. “

Phase I – Feature Selection Analysis of Corpus Interested in : binary classification of gene-phenotype pairs that co-occur in a given sentence – in our relation or not? Question: where are the meaningful words located? Between gene and disease? Is it sufficient to only look at a single sentence? Vocabulary Selection Simple bag of words with threshold Top words based on mutual information Word Counts as Features Raw counts of words vs. scaled counts Consider the use of positional information If only look at sentence level, then perhaps no need to scale

Phase II – Maximum Entropy Model Estimate the conditional distribution of the class label given an instance of gene-phenotype co-occurrence co-occur in a sentence labeled instance represented as set of word count features Use the labeled training data to estimate the expected value of the word counts (features) for each class training data used to set constraints on conditional probability Use Improved Iterative Scaling (IIS) to find a classifier of an exponential form which satisfies the constraints represented by the training data calculate parameters of maximum entropy model Max Ent - should prefer most uniform

Phase III – Evaluation Cross Validation on Labeled Examples Manual Annotation (Based on Andy’s review of automated annotation) Automatic Annotation (Based on Andy’s pattern matching) Some Questions we are interested in: What is the accuracy? What are the sources or error? Poor Feature selection? Have we oversimplified the problem? How can we improve? How much does our accuracy suffer if use only automatic annotation? Can we improve the automated annotation (and hence our classification accuracy?) Does this method have potential for extracting information from the literature that does not yet exist in a structured database?

Potential Plans for the Future Consideration of other classification methods Potentially merge work with Ted’s Named Entity tagger for genes Try to do some information extraction how is this gene involved in a given phenotype? Try with other databases are issues similar?