The Jikitou Biomedical Question Answering System: Using High-Performance Computing to Preprocess Possible Answers Michael A. Bauer 1,2, Daniel Berleant.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Multimedia Database Systems
Introduction to Databases
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
1 Information Retrieval and Web Search Introduction.
Real-time and Retrospective Analysis of Video Streams and Still Image Collections using MPEG-7 Ganesh Gopalan, College of Oceanic and Atmospheric Sciences,
Internet Resources Discovery (IRD) IBM DB2 Digital Library Thanks to Zvika Michnik and Avital Greenberg.
Advance Information Retrieval Topics Hassan Bashiri.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
DB2 Net Search Extender Presenter: Sudeshna Banerji (CIS 595: Bioinformatics)
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
The Jikitou Biomedical Question Answering System: Using a Syntactic Parser to Rank Possible Answers Michael A. Bauer 1,2, Daniel Berleant 1, Robert E.
Overview of Search Engines
ChemEd DL WikiHyperGlossary (WHG): A Social Semantic Information Literacy Service for Digital Documents 245 th ACS National Meeting CINF Oral Session Library.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Flexible Text Mining using Interactive Information Extraction David Milward
Creating Metabolic Network Models using Text Mining and Expert Knowledge J.A. Dickerson, D. Berleant, Z. Cox, W. Qi, and E. Wurtele Iowa State University.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Search Engine Architecture
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Multi-agent Systems in Medicine Štěpán Urban. Content  Introduction to Multi-agent Systems (MAS) What is an Agent? Architecture of Agent MAS Platforms.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
Chapter 23: Probabilistic Language Models April 13, 2004.
Web- and Multimedia-based Information Systems Lecture 2.
OWL Representing Information Using the Web Ontology Language.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
MIT Artificial Intelligence Laboratory — Research Directions The START Information Access System Boris Katz
Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
TECHNOLOGY IN THE CLASSROOM Integration of technology in teaching and learning.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Information Retrieval and Web Search
Search Engine Architecture
Information Retrieval and Web Search
Information Retrieval and Web Search
Data Warehousing and Data Mining
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Search Engine Architecture
Information Retrieval and Web Search
Presentation transcript:

The Jikitou Biomedical Question Answering System: Using High-Performance Computing to Preprocess Possible Answers Michael A. Bauer 1,2, Daniel Berleant 1, Robert E. Belford 1, and Roger A. Hall 1 1 University of Arkansas at Little Rock 2 University of Arkansas for Medical Sciences We live in an age where researchers have access to unprecedented amounts of biological information. This access to information cuts both ways; it allows for a more informed and empowered researcher but the deluge of information can become overwhelming when using traditional search engines. There is a need for intelligent information retrieval systems that can summarize relevant textual information while also incorporating multiple sources of information from reliable sources to satisfy a user’s query. Introduction Once potential answers are returned to the system, we rank them based on a combination of the semantic distance of key terms that we determined using the Link Grammar Parser and on the similarity of grammar structure found using the Stanford Parser. The system brings together traditional textual information and dedicated and vetted biological databases to present a concise answer to the user. The use of HPC enabled the preprocessing of the sentences in days instead of weeks which will be invaluable when the system is scaled up to include more sentences. Discussion This work is supported by the NSF Division of Undergraduate Education Award number and the IDeA Networks of Biomedical Research Excellence (INBRE) Program of the National Center for Research Resources. HyperGlossary A literacy tool, that we developed and integrated with Jikitou Automates the insertion of hyperlinks into the text answers. Connects them to textual definitions, multimedia content, and in the case of many molecules, 2D and 3D representations. Takes advantage of authoritative knowledge sources on the Internet. Fig. 7 Connection to the ChemEd Digital Library returns Jmols which allows you to interact with molecules in multiple ways beyond simple measurements, like connecting vibrations to IR Spectra Glossary definition JMOL view of an HG generated structure files Synonyms Autocomplete Associations A Quick Tour of Jikitou Question answering (QA) is a specialized type of information retrieval with the aim of returning precise short answers to queries posed as natural language questions. We have developed a QA system, named Jikitou, which answers natural language questions with sentences taken from Medline abstracts. Question Answering Natural Language Parsers High-Performance Computing Synonyms Suggestions: The WordNet and Gene Ontology databases are queried to find synonyms to terms entered. Autocomplete: Aspell, spell checking software, is used to suggest spellings for terms and simulate autocomplete functionality. A medical term dictionary has been added to account for biomedical domain specific terminology. Associations: Gene Ontology database is used to find associations with biological terms such as biological process, cellular components, or function. Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determination Answer Presentation Fig 6. As a user types a question the system suggests additional terms that can be added to refine the initial query. A natural language parser is a program used to determine the grammatical structure of a sentence. In this project we use this structure to match answers based on the semantics of sentences instead of just matching terms. Two parsers are used in this project, Linked Grammar Parser and Stanford Parser: References Zweigenbaum P, Demner-Fushman D, Yu H, Cohen KB Frontiers of biomedical text mining: Current progress. Brief. Bioinform 8(5): Sleator D and Temperley D Parsing English with a Link Grammar. Carnegie Mellon University Computer Science tech. report CMU-CS , October Fig. 1 Basic Elements of a QA System HP ProLiant DL980 Server CPU sockets8 Cores per socket10 Threads per core2 CPUs160 Link Grammar ParserStanford Parser DescriptionThe system is based on link grammar, it takes a sentence and assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words. A probabilistic natural language parsers that uses knowledge of language gained from hand- parsed sentences to assign the most probable structure to new sentences. LanguageCJava Due to the complexity and length of sentences found in biomedical literature, natural language parsing can be a very CPU intensive process. Preprocessing the sentence knowledgebase, using powerful multi-core servers, is required to eliminate the need to parse sentences on the fly. Therefore, increasing the responsive ness of the system. Perl scripts were written for each parser, which started the desired number of child processes and partitioned the approximately 4.5 million sentences into chunk of data to be processed in parallel. Initial preprocessing was done on a Dell Powerdege server using 8 of the available cores and took over 150 hours for the Link Grammar Parser and over 200 hours for the Stanford Parser. Using the HP ProLiant DL980 Server (Figure 3,4) reduced these times to 20 hours and 37 hours respectively. Figure 5 is a graph that shows the difference in processing time between the two systems for each parser. Preprocessing time (Hours) Dell PowerEdge Vs. HP ProLiant Server Fig. 3 HP ProLiant DL980 server specifications Fig. 4 Image of a HP ProLiant DL980 server Fig. 5 Graph showing the preprocessing times of the Dell PowerEdge Vs. the HP ProLiant server Fig. 2 Overview of the two parsers used in this project