Seeking Abbreviations From MEDLINE Jeffrey T. Chang Hinrich Schütze Russ B. Altman Presented by: Bo Han.

Slides:



Advertisements
Similar presentations
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Advertisements

Evaluation of Decision Forests on Text Categorization
Feature selection and transduction for prediction of molecular bioactivity for drug design Reporter: Yu Lun Kuo (D )
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Introduction to Natural Language Processing Phenotype RCN Meeting Feb 2013.
Curators’ Meeting Oct. 27, 2003 Clustering MeSH Representations of Medical Literature Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Reduced Support Vector Machine
Multi-Class Object Recognition Using Shared SIFT Features
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.
Word sense induction using continuous vector space models
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia.
Knowledge Acquisition from Game Records Takuya Kojima, Atsushi Yoshikawa Dept. of Computer Science and Information Engineering National Dong Hwa University.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Final Presentation Tong Wang. 1.Automatic Article Screening in Systematic Review 2.Compression Algorithm on Document Classification.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
NICTA Copyright 2013From imagination to impact Identifying Publication Types Using Machine Learning BioASQ Challenge Workshop A. Jimeno Yepes, J.G. Mork,
Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann European Bioinformatics Institute, Wellcome Trust Genome.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
Prediction of Influencers from Word Use Chan Shing Hei.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Chapter 23: Probabilistic Language Models April 13, 2004.
1 Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December,
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.
Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004 Mihai Surdeanu.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Class Imbalance in Text Classification
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Information Extraction for Clinical Data Mining: A Mammography Case Study H. Nassif, R. Woods, E. Burnside, M. Ayvaci, J. Shavlik and D. Page University.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
2014 Using machine learning to predict binding sites in proteins Jenelle Bray Stanford University October 10, 2014 #GHC
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Erasmus University Rotterdam
Terminology problems in literature mining and NLP
Batyr Charyyev.
Presentation transcript:

Seeking Abbreviations From MEDLINE Jeffrey T. Chang Hinrich Schütze Russ B. Altman Presented by: Bo Han

Challenge n Huge –Contains 12 million citations back to 1966 n Growing –400,000 citations per year n Common and Uncontrolled Use of Abbreviates –Authors create abbreviations in different ways

Example: Abb. Def. n VDR  vitamin D receptor n PTU  propylthiouracil n JNK  c-Jun N-terminal kinase n IFN  interferon n ATL  adult T-cell leukemia n Beta-EP  beta-endorphin

Previous Methods n Match Initial letters (Acronyms) filter out some common words Office of Nuclear Waste Isolation (ONWR) n Use Heuristics Favor matches on the first letter or syllable boundaries, Upper case Letters difficulty: finding optimal weights n Use Lexicon Rules

Motivation An automatic method for detecting abbreviations first step for understanding biological literatures Rather than text mining skills, using sequence alignment to find candidates Classification algorithm is used to confirm candidates

Method Scanning text for possible abbreviations Aligning the candidates Converting the abbreviations and alignments into a feature vector Classify by a machine learning method

Finding Candidates n Pattern long form (abbreviation) n Assumption: ignore abb. longer than two words n Within Parentheses, stop at a comma/semicolon n 3N for long form

Aligning Abbreviations n Longest Common Substring(LCS) n Seq Alignment: Maximize # of matched letters long form --- abbreviation O(NM)

Aligning Abbreviations n Antioxidant response element A R E

Computing Features in an alignment n Rather using scoring matrix, they use feature vector n Choosing Features Lower/Upper Case Beginning of Word 5.54 End of Word Syllable Boundary 2.08 After Aligned Letter 1.50 Letters Aligned 3.67 Words Skipped Aligned Letters per Word 0.70 Constant -9.70

Scoring Alignment n Supervised Learning n Training set=1000 random candidates n Binary Logistic Regression Classifier P: probability of seeing an abbreviations X: feature vector W: weight vector Finding w which maximized the difference between positive and negative examples

Evaluation n A gold standard corpus n Recall (84%) Ten correct not retrieved out of 50 correct docs 40/50=80% n Precision(81%) 65 out of 100 docs are relevant 65/100=65%

Test it on-line

Disadvantage n Strict form n Without considering the context n Some grammar rule is ignored in feature vector ubMed&list_uids= &dopt=Abstract

Conclusions A novel algorithm for finding abbreviation Combing sequence alignment and machine learning Further work is expected to improve performance

Reference Chang JT, Schütze H, and Altman RB (2002). Creating an Online Dictionary of Abbreviations from MEDLINE. The Journal of the American Medical Informatics Association. 9(6): Acronym finder. Iliopoulos I, Enright A, Ouzounis C. Textquest: document clustering of medline abstracts for concept discovery in molecular biology. Pac Symp Biocomput 2001;