Mining the Medical Literature Chirag Bhatt October 14 th, 2004.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Chapter 5: Introduction to Information Retrieval
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Sequence Similarity Searching Class 4 March 2010.
Xyleme A Dynamic Warehouse for XML Data of the Web.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
Presented by Zeehasham Rasheed
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Information Retrieval
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Flexible Text Mining using Interactive Information Extraction David Milward
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Information Retrieval in Practice
Search Engine Architecture
Text Based Information Retrieval
Applications of Text Mining
CS 430: Information Discovery
Introduction Task: extracting relational facts from text
Basic Local Alignment Search Tool
Batyr Charyyev.
Text Mining & Natural Language Processing
Panagiotis G. Ipeirotis Luis Gravano
Basic Local Alignment Search Tool
Topic: Semantic Text Mining
Presentation transcript:

Mining the Medical Literature Chirag Bhatt October 14 th, 2004

Why MINE data! Medical, genomics, proteomics research Find causal links between symptoms or diseases and drugs or chemicals Gene comparison

An example Problem What is causing an uncharacteristic behavior in protein production? Solution Find which genes have a roll to play in amino acid synthesis? How? Search through online literature for genes that play a role in amino acid synthesis

Search vs. Discover Search (goal oriented) Discover (opportunistic) Structured data (database) Data retrievalData mining Unstructured Data (text) Information Retrieval Text mining

Data Retrieval Company Database e.g. Customer records, product inventory Search entity (structured) records Query (goal-driven) What is the address of our client? How many widgets are in stock? SQL, Oracle, DB2, etc

Information Retrieval Google, A9, AltaVista Query (goal-driven) Search entity (unstructured) documents variable format html, pdf, etc

Data Mining Structured data set Generally a large amount of (historical) data Find relations or patterns or trends in database (opportunistic) Eg “ beer and diaper ”

Text Mining Unstructured data set Documents, publications, abstracts, web pages Discover useful and previously unknown “ gems ” of information in large text collections using patterns, trends and domain knowledge

Need for mining text Approximately 90% of the world ’ s data is held in unstructured formats (source: Oracle Corporation)

Why Text Mining in Medical Literature? Many multi-functional genes Screen functionally interesting ones Complexity of needs increasing Individual genes -> family of genes Manual Text Mining ? Not really! Availability of published literature online

Functionally Coherent Genes Group of genes that exhibit similar experimental features Amino acid metabolism, electron transport, stress response

Difficulties Difficulties faced in finding functionally coherent genes Most genes express multi- functionality Some genes studied extensively and some only just discovered

Semantic neighbor Two articles are semantic neighbors if they have similar word usage Use statistical natural language processing to access and interpret online text

Methodology

Find semantic neighbors in document set If any article about common functionality contains atleast one in the group then the group is functionally coherant

Neighbor divergence Scoring method Each articles relevance to gene group is scored by: count of number neighbors that have references to the group

Neighbor divergence scores If score distribution is different from Poisson then gene group represents biological function The log ratio for a Poisson distribution should be flat along the horizontal axis

Need to filter results Generally well-studied genes tend to have semantic neighbor that refer to same gene Neighbor may not be relevant to group function, but increases score – false positive So only articles that refer to different genes are considered

Evaluation Report percentile of a functional group of genes Calculate precision and recall at different cutoff levels (next slide) Remove legitimate genes with irrelevant genes in group

Precision and Recall

Results Sample Space: 19 known yeast groups and 1900 random groups

Results

Replacing functional genes

Limitations of neighbor divergence Neighbor divergence helps group genes not tell us function Work based on abstracts only Entire literature search may prove challenging Break into smaller components

Another mining approach Extracting synonymous gene and protein terms

Why find synonyms? Genes and proteins are often associated with multiple names across articles and sub domains More names keep getting added new functional or structural information is discovered Improve search and analysis

Current work Biological databases such as GenBank and SWISSPROT include synonyms Not up to date Disagreement on some synonyms Laborious manual curation and review Need for automation

Two-step problem Identifying gene and protein names Done by state-of-the-art taggers Determining whether these names are synonymous We ’ ll discuss more on this …

Current synonym approaches Synonymous gene and protein names represent same biological substance Exhibit identical biological functions Same gene or amino acid sequences Other approaches String matching Matching abbreviations to full-forms

Gene and Protein Tagging Identification step Uses BLAST techniques and domain knowledge to pick out genes and protein terms Heuristics Synonyms usually occur within same sentence Synonyms mentioned in first few pages of article

Synonym detection approaches Unsupervised - ‘ Similarity ’ based on contextual similarity Semi-supervised - ‘ Snowball ’ extracts structured relations using patterns Supervised - Text Classification/SVM Hand-crafted extraction – GPE Combined system

Combined Approach Combine output of SnowBall, SVM, and GPE Each system gives a confidence score for each synonym pair Where, s = is a synonym pair and Conf E (s) is confidence assigned to s by individual extraction by the system E

Unsupervised - Similarity Context based All words occurring within a ‘ x ’ word window False positives are very common Run time – O(|lexicon| 3 )

Semi-supervised - Snowball Manual feedback mechanism

Supervised – Text Classification Input: known synonym pairs Automatically find contexts and assign weights Train classifier to distinguish between ‘ positive ’ and ‘ negative ’ contexts Eg ‘ A also known as B ’ and ‘ A regulates B ’

Why Combined Approach? SnowBall and SVM, machine-learning based captures synonyms that may be missed by GPE GPE, knowledge-based SnowBall and SVM have many false positives Combine both advantages

Results

Summary Text mining Semantic neighbor Neighbor divergence Precision and Recall Synonym detection Approaches Comments / Questions?