Information Retrieval and its Application in Biomedicine Hong Yu 1,2, PhD Susan McRoy 1, PhD 1 Department of Computer Science 2 Department of Health Sciences University of Wisconsin-Milwaukee Sept 4 Introduction
What is Information Retrieval? The field concerned with the acquisition, organization, and searching of knowledge-based information. (Hersh, 2003) The field concerned with the acquisition, organization, and searching of knowledge-based information. (Hersh, 2003)
Speed Up Communication
Information World Wide Web World Wide Web Company Documentations Company Documentations Drug Descriptions Drug Descriptions Medical Records Medical Records Books Books Everything that is text, image, video, and sound, and that can be transformed digitally Everything that is text, image, video, and sound, and that can be transformed digitally
Information in Biomedicine Literature (over 17 million publications) Literature (over 17 million publications) WWW WWW Electronic medical records Electronic medical records Genomics data Genomics data –DNA sequences, etc. Knowledge representation Knowledge representation –Gene Ontology Company databases Company databases –Micromedex drug database
IR in Biomedicine Index Medicus (Billings 1879) Index Medicus (Billings 1879) MEDLARS (NLM 1966) MEDLARS (NLM 1966) SAPHIRE (Hersh 1990) SAPHIRE (Hersh 1990) PubMed (NLM 1996) PubMed (NLM 1996) Arrowsmith (Smalheiser 1998) Arrowsmith (Smalheiser 1998) BioText (Hearst 2003) BioText (Hearst 2003) BioMedQA (Yu 2006) BioMedQA (Yu 2006)
Electronic and Open Publishing Internet and Web have a profound impact on the publishing of knowledge-based information Internet and Web have a profound impact on the publishing of knowledge-based information Most of literature can be electronically available Most of literature can be electronically available Open-access Open-access –The Bethesda Statement on Open Access Publishing ( (April 11, 2003) –The Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities ( berlin/berlindeclaration.html). (2003) berlin/berlindeclaration.htmlhttp:// berlin/berlindeclaration.html –PubMedCentra (NLM 2004)
Quality of Information A lack of quality control A lack of quality control –Anyone can publish online –A wealthy of studies concluded that Web has a poor quality for healthcare information Readability Readability –Hard to read
Information Needs and Seeking Unrecognized needs Unrecognized needs –Clinicians unaware of information needs or knowledge deficit Recognized needs Recognized needs –Clinicians aware of needs but may or may not pursue them Pursued needs Pursued needs –Information seeking occurs but may or may not be successful Satisfied needs Satisfied needs –Information seeking successful
Evidence-Based Medicine
What You Will Learn IR algorithms IR algorithms –Indexing –Query and Retrieval –Evaluation –Text Classification –XML retrieval –Web retrieval
What You Will Learn (Cont.) Open-Source IR tools Open-Source IR tools –What open-source IR tools are available Indexing/retrieval Indexing/retrieval Part-of-speech and syntactic parsing Part-of-speech and syntactic parsing Semantic parsing Semantic parsing Discourse relations Discourse relations Machine-learning classifiers Machine-learning classifiers How to use the tools? How to use the tools?
What You Will Learn (Cont.) State of the art IR systems State of the art IR systems –Baruch 1965 [BLIMP ] –SAPHIRE (Hersh 1990) Retrieval Retrieval –MedLEE (Friedman 1994) Extraction Extraction –PubMed (NLM 1997) PubMed –ARROSMITH Systems (Smalheiser 1998) ARROSMITH Systems ARROSMITH Systems Hidden Relation Discovery Tool Hidden Relation Discovery Tool –GENIES (Friedman 2001) Extraction Extraction
BioText ( Hearst ) BioText ( Hearst ) –Retrieval+Categorization GeneWays ( Rzhetsky ) GeneWays ( Rzhetsky ) –Extraction+Visualization TextPresso ( Muller ) TextPresso ( Muller ) –Retrieval+Extraction iHOP ( Hoffman and Valencia net.org/UniPub/iHOP/ ) iHOP ( Hoffman and Valencia net.org/UniPub/iHOP/ ) net.org/UniPub/iHOP/ net.org/UniPub/iHOP/ –Retrieval BioMedQA ( Yu ) BioMedQA ( Yu ) BioMedQA –Question Answering BioNLP Systems
Advanced NLP applications
Beyond text: Image and Video Image classification Image classification –Finding concepts in captions and annotations –Machine learning on textual & visual features –Determining salient features in text and image separately and merging the results Extracting text from image Extracting text from image –Understanding and correcting OCR (handwriting, equations) –Finding text in images Finding document text related to illustrations Finding document text related to illustrations Video retrieval Video retrieval Video retrieval Video retrieval
Beyond Extraction: Experimental Tools
Resources Annotated collections (GENIA, Medstract, Yapex …) Annotated collections (GENIA, Medstract, Yapex …) Ontologies, tools, knowledge bases … Ontologies, tools, knowledge bases … Publications, Conferences, Evaluations … Publications, Conferences, Evaluations … Centres and web portals Centres and web portals
What We Provide Textbook Textbook –Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze. Introduction to Information Retrieval. Cambridge University Press, 2007 Introduction to Information RetrievalIntroduction to Information Retrieval retrieval-book.html retrieval-book.html Office hour: Office hour: –Tuesdays, 3-4 pm EMS 710 and by appointment –Hong Yu, –Susan McRoy,
What We Expect Undergraduate: Undergraduate: –30% Homework, 35% Midterm exam, 35% Final exam or project Graduate: Graduate: –20% Midterm exam, 40% Homework, 40% Project: The project may be done individually or in a team of 2-3 people. The final project will include a software system, a 2-3 page written project report, and an oral presentation. The report should describe the problem, the approach, and evaluation and should cite related work where appropriate.