UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics:

Slides:



Advertisements
Similar presentations
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
Advertisements

Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
BioText Infrastructure Ariel Schwartz Gaurav Bhalotia 10/07/2002.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
Overview of Search Engines
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Text- and Content-based Approaches to Image Retrieval for the ImageCLEF 2009 Medical Retrieval Track Matthew Simpson, Md Mahmudur Rahman, Dina Demner-Fushman,
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
The Descent of Hierarchy, and Selection in Relational Semantics* Barbara Rosario, Marti Hearst, Charles Fillmore UC Berkeley *with apologies to Charles.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
UIC at TREC 2007: Genomics Track Wei Zhou, Clement Yu University of Illinois at Chicago Nov. 8, 2007.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Class Imbalance in Text Classification
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.
Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Queensland University of Technology
CRF &SVM in Medication Extraction
Category-Based Pseudowords
Evaluation of IR Performance
Panagiotis G. Ipeirotis Luis Gravano
The Descent of Hierarchy, and Selection in Relational Semantics*
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Introduction to Search Engines
Marti Hearst Associate Professor SIMS, UC Berkeley
Presentation transcript:

UCB BioText TREC 2003 Genomics Track Participants: Marti Hearst Gaurav Bhalotia, Preslav Nakov, Ariel Schwartz University of California, Berkeley Genomics: tasks 1 and 2

Overview UCB BioText group took part in Task 1 and Task 2 Task 1: Information retrieval + Information Extraction (+ Text Classification) Task 2: Text Classification + Information Extraction Commonalities for the both tasks Named entities recognition in the text Genes and synonyms MeSH concepts Text classification algorithms

MeSH Hierarchy Unique identifier: e.g. Abdomen has D UMLS semantic tags e.g. Enzyme, Gene or Genome, Mammal, Tissue, Virus etc. Alphanumeric descriptor codes [A] Anatomy Body Regions [A01] Abdomen [A01.047] [B] Musculoskeletal System [A02] Back [A01.176] [C] Digestive System [A03] Breast [A01.236] [D] Respiratory System [A04] Extremities [A01.378] [E] Urogenital System [A05] Head [A01.456] [F] …… Neck [A01.598] [G] …. [H] Physical Sciences Electronics Amplifiers [I] Astronomy Electronics, Medical [J] Nature Transducers [K] Time

Task 1

TREC Task 1: Overview Search 525,938 MedLine records Titles, abstracts, MeSH category terms, citation information Topics: Taken from the GeneRIF portion of the LocusLink database We are supplied with a gene names Definition of a GeneRIF: For gene X, find all MEDLINE references that focus on the basic biology of the gene or its protein products from the designated organism. Basic biology includes isolation, structure, genetics and function of genes/proteins in normal and disease states. Task 1

TREC Task 1: Sample Query Homo sapiens OFFICIAL_GENE_NAME ets variant gene 6 (TEL ncogene) Homo sapiens OFFICIAL_SYMBOL ETV Homo sapiens ALIAS_SYMBOL TEL Homo sapiens PREFERRED_PRODUCT ets variant gene Homo sapiens PRODUCT ets variant gene Homo sapiens ALIAS_PROT TEL1 oncogene The first column is the official topic number (1-50). The second column contains the LocusLink ID for the gene. The third column contains the name of organism. The fourth column contains the gene name type. The fifth column contains the gene name. Task 1

Classifier "has GeneRIF" weight 0.01 General Architecture Task 1

Main Challenges Task 1 Given a gene and an organism, find documents likely to have a GeneRIF Relevance judgment: GeneRIF references from LocusLink Main challenges Ranking Recall Find more gene synonym variations Precision Filter out abstracts with genes from incorrect organisms Lower the rank of documents not likely to have a GeneRIF Task 1

Gene Synonym List Creation Task 1

How to Find Gene Name Synonyms? Strategy: Compile a list of gene names from the text Start with a list of gene names from LocusLink and MeSH Use an n-gram-based approximate match algorithm to find alternative representations of these genes in Medline abstracts Look for commonalities and regularities Create a set of name transformation rules Some are better than others Task 1

Gene Expansion: Sample Expansion Pairs Task 1 Matches whose Dice coefficient falls between 0.5 and 1.0

Gene Expansion: High Confidence Rules Matches whose Dice coefficient falls between 0.5 and 1.0 Rules determined by inspection Task 1

Organism Filtering Task 1

Organism Filtering: Strategy Problem: The query describes the organism name using the LocusLink terminology which differs from Medline’s Strategy: Semi-automatically determine the translation: For a given LocusLink organism name, search for that term against the MEDLINE title, abstract, and MeSH terms Display the most frequent MeSH terms that result The translation appeared as one of the top 3 Could be a useful strategy for other translation problems Task 1

Organism Filtering: Results Task 1 Sample Top-Ranked MeSH Terms

GeneRIF Classification Task 1

GeneRIF Classification: Training Used for our second run Motivation Only Medline documents that have been assigned GeneRIFs are considered relevant Strategy to improve precision: Identify documents likely to have a GeneRIF assigned Naïve Bayes classifier (WEKA ML tools) Training: 50 gene names, not in TREC training/testing set Train on 1000 top-ranked documents for each gene Task 1

GeneRIF Classification: Results Task 1

Document Ranking Task 1

Document Ranking DB2 Net Search Extender Score = weighted SUM: 1.0 * (H compared to phrases in titles) * (H compared to phrases in abstracts) * (L compared to phrases in titles) * (L compared to phrases in abstracts) * (query MeSH compared to document MeSH) H: high confidence gene rules L: low confidence Weights determined experimentally Task 1

Document Retrieval and Ranking Task 1

MAP on TREC training data using GeneRIF classifier: without GeneRIF classifier: MAP on TREC testing data using GeneRIF classifier: without GeneRIF classifier: Analysis Using the classifier performs better on 27 out of 50 queries (= on 12). Tuning the parameters on the test set (tried afterwards) results in only minor improvement. Task 1: TREC Evaluation Task 1

Task 2

TREC Task 2 Problem Definition: Given GeneRIFS formatted as: J Biol Chem 2002 Sep 13;277(37): the death effector domain of FADD is involved in interaction with Fas Nucleic Acids Res 2002 Aug 15;30(16): In the case of Fas-mediated apoptosis, when we transiently introduced these hybrid- ribozyme libraries into Fas-expressing HeLa cells, we were able to isolate surviving clones that were resistant to or exhibited a delay in Fas-mediated apoptosis w … reproduce the GeneRIF from the MEDLINE record. Task 2

Preliminary study Find the GeneRIF text in the abstract 33,662 MEDLINE abstracts with GeneRIFs Best match of the GeneRIF text in the abstract Modified Unigram Dice coefficient Accepted, if scored above 80% Task 2

Baseline Baseline: Pick the whole title verbatim Motivation the best match was a substring of the title: 46.30% the whole title was the best match in 65.10% Baseline: Modified Unigram Dice score 53.39% Choose: title vs. last sentence Observation: the best match is the title OR the last sentence: 73.40% If we choose a whole sentence: title vs. last sentence Upper bound (best choice each time): 66.33% Lower bound (worst choice each time): 22.62% Task 2

Features We experimented with the following features: Nominal features words/stems verbs (most frequent: e.g. bind, block, accept etc.; nominalized) genes gene_freq (number of gene names mentioned) MeSH_unique_ID (e.g. D005796) MeSH_codes (level 1: G14, or level 2: G14.330) MeSH_semantic_type (e.g. cell, human, biological function) journal publication_date (month and year, e.g. 10_2003 ) Boolean features target_gene (is the target gene mentioned?) is_title (is the current sentence the title?) is_last_sentence (is this the last sentence?) Task 2

Best Features Standard feature set verbs (most frequent: e.g. bind, block, accept etc.; nominalized) genes_freq (number of gene names mentioned) MeSH_code (cut at level 2, e.g. G ) target_gene (is the target gene mentioned?) is_title (is the current sentence the title?) Is_last_sentence (is this the last sentence?) The last two were not used in the final tests. Weighted using TF.IDF (except the Boolean features) Task 2

Title vs. Last Sentence Text classification Choose: title (class A) vs. last sentence (class B) Naïve Bayes classifier (WEKA ML tools) The standard features Training and testing Each document represents one example Features: extracted from the title and the last sentence only  Features for title and last sentence are undistinguishable.  Distinguishing them lowers the accuracy. Training set: Modified Dice Unigram overlap with the GeneRIF Stratified 10-fold cross-validation Task 2

Task 2: Evaluation Training Document collections 1000, 2000, 10000, 20000, finally, limited the set to the 5 target journals Classification algorithm selection  tried: decision tree, boosting, kNN, logistic regression etc. Feature selection tuning, for a fixed feature set tuned the best minimum frequency thresholds for verbs and MeSH_codes: 12 and 5, accordingly TREC run Training: 5 journals except the 139 abstracts from the TREC test Feature frequency thresholds as found during training: 12 and 5 Task 2

Task 2: Results Task 2

Discussion Test sets are small and much harder than training sets Task 1 Organism filter was very helpful Noisy GeneRIF assignment limits the help given by the classifier Initial runs supplied by other research groups were very helpful Task 2 Sentence truncation could improve the results Need ranking, rather than classification algorithms Better feature selection needed sensitivity to frequency thresholds MeSH ambiguity verb nominalization

Thank you!