UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.

Slides:



Advertisements
Similar presentations
Improved TF-IDF Ranker
Advertisements

QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Presenters: Başak Çakar Şadiye Kaptanoğlu.  Typical output of an IR system – static predefined summary ◦ Title ◦ First few sentences  Not a clear view.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
LING 573: Deliverable 3 Group 7 Ryan Cross Justin Kauhl Megan Schneider.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms  Synonyms, stemming variations, terms.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Jiuling Zhang  Why perform query expansion?  WordNet based Word Sense Disambiguation WordNet Word Sense Disambiguation  Conceptual Query.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Question Answering From Zero to Hero Elena Eneva 11 Oct 2001 Advanced IR Seminar.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Abstract Question answering is an important task of natural language processing. Unification-based grammars have emerged as formalisms for reasoning about.
1 Query Operations Relevance Feedback & Query Expansion.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Word Sense Disambiguation in Queries Shaung Liu, Clement Yu, Weiyi Meng.
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
CIKM Recognition and Classification of Noun Phrases in Queries for Effective Retrieval Wei Zhang 1 Shuang Liu 2 Clement Yu 1
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
CIKM Opinion Retrieval from Blogs Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 1 Department of.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
A System for Finding Biological Entities that Satisfy Certain Conditions from Texts Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Automatic Question Answering  Introduction  Factoid Based Question Answering.
Performance Measurement. 2 Testing Environment.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
UIC at TREC 2007: Genomics Track Wei Zhou, Clement Yu University of Illinois at Chicago Nov. 8, 2007.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
Chapter 6 Queries and Interfaces. Keyword Queries n Simple, natural language queries were designed to enable everyone to search n Current search engines.
Hui Fang (ACL 2008) presentation 2009/02/04 Rick Liu.
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Queensland University of Technology
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Terminology problems in literature mining and NLP
Batyr Charyyev.
CSCI 5832 Natural Language Processing
Presentation transcript:

UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006

3 stages Stage 1: Conversion - Greek letters  English words Stage 2: Paragraph retrieval - retrieve 2,000 most relevant paragraphs Stage 3: Passage extraction and ranking - extract and retrieve 1,000 most relevant passages

Stage 1: conversion Convert the Greek letters into English words, for example, TGF β1  TGF beta1 (β, in the HTML documents, may be represented by “&#223” or “beta.gif”)

Stage 2: paragraph retrieval The goal of this stage is to retrieve 2,000 most relevant paragraphs. Several techniques are utilized: 1. conditional porter stemming 2. gene symbol lexical variants handling 3. concept retrieval IR model 4. query expansion 5. abbreviation correction.

Stage 2: paragraph retrieval - conditional Porter stemming Potential errors of the Porter stemmer Type 1: gene symbol  non-gene word e.g., “Pes”  “Pe”, “IDE”  “ID” Type 2: non-gene word  gene symbol e.g., “IDEE”  “IDE” solution: a table (Entrez gene database) containing all the gene symbols is maintained.

Stage 2: paragraph retrieval - handling lexical variants of gene symbols 2 strategies: Strategy 1: automatically generate lexical variants (Buttcher, 2004; Huang, 2005). e.g., PLA2  PLA 2, PLAII, and PLA II Strategy 2: retrieve additional lexical variants from a term database of MEDLINE (Zhou, 2006). e.g., PLA2  PL-A2 Note: PLA2: Phospholipase A2

Stage 2: paragraph retrieval - concept retrieval (IR model) Definition: A concept is a biomedical meaning or sense. 1) a gene and its synonym set refer to the same concept; 2) a MeSH and its synonym set refer to the same concept.

Stage 2: paragraph retrieval - concept retrieval (IR model) Assumption: Okapi does not work well if the query contains multiple concepts. For example: q: “role of gene PRNP in mad cow disease.” concept 1 concept 2 d1: has many occurrences of concept 2 d2: has small number of occurrences of both concepts Okapi: sim(q,d1)>sim(q,d2), but intuitively d2 is more relevant than d1.

Stage 2: paragraph retrieval - concept retrieval (IR model) According to our model (Liu, 2004; UIC Robust track, 2005), we have: because: although, includes both concept 1 & concept 2

Stage 2: paragraph retrieval - query expansion Synonyms Hyponyms (more specific terms) Pseudo-feedback Related terms

Stage 2: paragraph retrieval - query expansion using biomedical knowledge Related terms (Co-occur frequently & related semantically) q: How do interactions between HNF4 and COUP-TF1 suppress liver function" There exists relationships between the semantic type of a related term and the semantic type of each query concept in UMLS semantic network. Liver Hepatocytes Hepatoblastoma Gluconeogenesis Hepatitis B virus HNF4 and COUP-tf I related terms

Stage 2: paragraph retrieval - avoid incorrect match of abbreviations Given a query with both an abbreviation of a gene symbol and its full form, a document will match the term only if both its abbreviation and its full form are matched. For example, q: role of APC (adenomatous polyposis coli) in colon cancer ? d: “ …Much work has been undertaken in recent decades with the aim of producing projections of future cancer incidence and mortality rates from observed rates by using age-period-cohort (APC) models… ” Notice that gene symbols are usually abbreviations, which are very ambiguous in the biomedical literature.

Stage 3: passage extraction and ranking The goal of this stage is to take the output of stage 2 (i.e., 2,000 most relevant paragraphs) and identify the 1,000 most relevant passages (i.e., one or more consecutive sentences within paragraphs).

Stage 3: passage extraction and ranking - extraction The criterion for the optimal passage in a paragraph is given by: “Given various windows of different sizes, choose the one which has the maximum number of query concepts and the smallest size.”

Stage 3: passage extraction and ranking - ranking The ranking of passages is similar to the ranking of paragraphs. For each passage, we computed its concept similarity and word similarity with the query. Then the concept retrieval model is applied for the ranking.

Experiment results 3 runs: UICgen1: the top 1,000 most relevant paragraphs were returned as the passages. UICgen2: the top 1,000 optimal passages according to the criterion were returned (some bugs). UICgen3: same as UICgen2, except the bugs were removed.

Experiment results

Reference Buttcher S, Clarke CLA, Cormack GV: Domain-specific synonym expansion and validation for bio-medical information retrieval (MultiText experiments for TREC 2004). The Thirteenth Text REtrieval Con-ference (TREC 2004) Proceedings, 2004, Gaithers-burg, MD. Huang X, Zhong M, Si L. York University at TREC 2005: Genomics Track. The Fourteenth Text RE-trieval Conference (TREC 2005) Proceedings, 2005, Gaithersburg, MD. Zhou W, Torvik VI, Smalheiser NR. ADAM: Another Database of Abbreviations in MEDLINE. Bioinformatics 2006; 22(22): Liu S, Liu F, Yu C, and Meng WY. An Effective Approach to Document Retrieval via Utilizing WordNet and Recognizing Phrases. Proceedings of the 27th Annual International ACM SIGIR Confer-ence, pp , Sheffield, UK, July Liu S, Yu C. UIC at TREC2005: Robust Track. The Fourteenth Text RE- trieval Conference (TREC 2005) Proceedings, 2005, Gaithersburg, MD.

Questions Thanks!