A Probabilistic Term Variant Generator for Biomedical Terms Yoshimasa Tsuruoka and Jun ichi Tsujii CREST, JST The University of Tokyo.

Slides:



Advertisements
Similar presentations
Natural Language Tools and Resources for Biomedical Information Extraction Yoshimasa Tsuruoka Tsujii laboratory University of Tokyo.
Advertisements

1 National Centre for Text Mining Mission To provide TM tools for users, in particular, scientists and researchers To coordinate activities in the TM community.
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.
Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley.
Modern Information Retrieval Chapter 4 Query Languages.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Alert Correlation for Extracting Attack Strategies Authors: B. Zhu and A. A. Ghorbani Source: IJNS review paper Reporter: Chun-Ta Li ( 李俊達 )
Retrieval Effectiveness of an Ontology-based Model for Information Selection Khan, L., McLeod, D. & Hovy, E. Presented by Danielle Lee.
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
NLP for Biomedicine - Ontology building and Text Mining - Junichi Tsujii GENIA Project ( Computer Science Graduate.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Automated Form processing for DTIC Documents March 20, 2006 Presented By, K. Maly, M. Zubair, S. Zeil.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
Recent Trends in Text Mining Girish Keswani
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
1 CSA4050: Advanced Topics in NLP Spelling Models.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Seeking Abbreviations From MEDLINE Jeffrey T. Chang Hinrich Schütze Russ B. Altman Presented by: Bo Han.
1 Approximate Algorithms (chap. 35) Motivation: –Many problems are NP-complete, so unlikely find efficient algorithms –Three ways to get around: If input.
Questions # 1 DNA carries the code for making proteins.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
DNA to Proteins. Unraveling DNA *The structure of DNA allows it to hold information *The order of the bases is the code that carries the information *A.
Distance functions and IE – 5 William W. Cohen CALD.
Distance functions and IE – 4? William W. Cohen CALD.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Information Retrieval CSE 8337 Spring 2005 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics.
From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009.
Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University.
On using context for automatic correction of non-word misspellings in student essays Michael Flor Yoko Futagi Educational Testing Service 2012 ACL.
Analyzing Visual Scan Paths of Professionals and Novices using Levenshtein Distance Zach Belou, Justin Clemmons, Rebecca Gravois, Norwood Hingle, Jessica.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
 During replication (in DNA), an error may be made that causes changes in the mRNA and proteins made from that part of the DNA  These errors or changes.
January 2012Spelling Models1 Human Language Technology Spelling Models.
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
An Improved Search Algorithm for Optimal Multiple-Sequence Alignment Paper by: Stefan Schroedl Presentation by: Bryan Franklin.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A method of extracting malicious expressions in bulletin board systems by using context analysis Presenter:
Spell checking. Spelling Correction and Edit Distance Non-word error detection: – detecting “graffe” “ سوژن ”, “ مصواک ”, “ مداا ” Non-word error correction:
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
Web-based acquisition of Japanese katakana variants
Recent Trends in Text Mining
Variation among organisms
Ambika Shrestha Chitrakar Prof. Slobodan Petrovic
Do Now 2/12.
MUTATIONS.
String matching.
To be successful today…
Do Now 2/12.
MUTATIONS.
Draw a conclusion from this graph for both the red and blue line
MUTATIONS.
15-826: Multimedia Databases and Data Mining
Presentation transcript:

A Probabilistic Term Variant Generator for Biomedical Terms Yoshimasa Tsuruoka and Jun ichi Tsujii CREST, JST The University of Tokyo

Outline Probabilistic Term Variant Generator Generation Algorithm Application: Dictionary expansion

Information extraction from biomedical documents Recognizing technical terms (e.g. DNA, protein names) We measured glucocorticoid receptors ( GR ) in mononuclear leukocytes ( MNL ) isolated … Background

Technical Term Recognition Machine learning based Identifying the regions of terms No ID information Dictionary-based Comparing the strings with each entry in the dictionary ID information

Problems of Dictionary-based approaches Spelling variation degrades recall Approximate string searching False positives degrade precision Filtering by machine learning

Exact String Searching Example Text Phorbol myristate acetate induced Egr-1 mRNA … Dictionary EGP EGR-1 EGR-1 binding protein : Any of them does not match

Edit Distance Defines the distance of two strings by the sequence of three kinds of operations. Substitution Insertion Deletion Ex.) board abord Cost = 2 (delete `a and add `a )

Automatic Generation of Spelling Variants Variant Generator NF-Kappa B(1.0) NF Kappa B (0.9) NF kappa B(0.6) NF kappaB(0.5) NFkappaB(0.3) : Generator NF-Kappa B Each generated variant is associated with its generation probability

Generation Algorithm T cell (1.0) T-cell (0.5)T cells (0.2) T-cells (0.1) Recursive generation P = P x P op

Collecting Examples of Spelling Variation Abbreviation Extraction Schwartz 2003 Extracts short and long form pairs Short formLong form AAAlcoholic Anonymous American Americans Arachidonic acid arachidonic acid anaemia anemia :

Learning Operation Rules Operations for generating variants Substitution Deletion Insertion Context Character-level context: preceding (following) two characters Operation Probability

Probabilistic Rules Probability Left- context Target Right- context Operation 0.96* End of String Delete 0.96 Start of String ImReplace I with i 0.95*HydReplace H with h ::::: 0.75ph End of String Insert y :::::

Example (1) Generation Probability Generated VariantsFrequency 1.0 (Input)NF-kappa B NF-kappaB nF-kappa B Nf-kappa B NF kappa B NF-kappa b0 :::

Example (2) Generation Probability Generated VariantsFrequency 1.0 (input)antiinflammatory effect anti-inflammatory effect antiinflammatory effects Antiinflammatory effect antiinflammatory-effect anti-inflammatory effects23 :::

Example (3) Generation Probabilitiy Generated VariantsFrequency 1.0 (Input)tumour necrosis factor alpha tumor necrosis factor alpha tumour necrosis factor-alpha Tumour necrosis factor alpha tumor necrosis factor alpha Tumor necrosis factor alpha8 :::

Application: Dictionary Expansion Expanding each entry in the dictionary Threshold of Generation Probability: 0.1 Max number of variants for each entry: 20

Protein Name Recognition Information Extraction Longest match GENIA corpus

Results of Dictionary Expansion a

Conclusion Probabilistic Variant Generator Learning from actual examples Dictionary expansion by the generator improves recall without the loss of precision.