Evidence-Based Information Retrieval in Bioinformatics

Evidence-Based Information Retrieval in Bioinformatics
Timothy B. Patrick, PhD Healthcare Administration and Informatics, University of Wisconsin-Milwaukee

Goal of the Project The overall, long term goal of this research project is to contribute to evidence-based information retrieval in post-genomic medicine proof of the effectiveness of the way particular information resources are used and combined in order to retrieve that information The overall, long term goal of this research project is to contribute to evidence-based information retrieval in post-genomic medicine. That a biomedical endeavor is evidence-based, no matter whether it is focused on patient care or on the discovery of gene function, implies that both its decision making and its retrieval of information are evidence-based. Evidence-based decision making addresses the need to base decisions, whether they concern patient care or discovery of gene function, on the results of prior scientific study. Evidence-based information retrieval, on the other hand, addresses the necessary prior step of having proof of the effectiveness of the way particular information resources are used and combined in order to retrieve that evidence.

Aims Specific Aim 1: Determine existing pitfalls in accessing literature on gene function Specific Aim 2: Based on user warrant, determine the current state of evidence-based functional genomic retrieval Specific Aim 3: Based on literary warrant, determine the current state of evidence-based functional genomic retrieval Specific Aim 1: Determine existing pitfalls in accessing literature on gene function Specific Aim 2: Determine the current state of evidence-based functional genomic retrieval based on user warrant Specific Aim 3: Determine the current state of evidence-based functional genomic retrieval based on literary warrant

“Determine existing pitfalls in accessing literature on gene function”
That is the topic of my talk later today. “Asymmetries in Retrieval of Gene Function Information” The first aim, to determine existing pitfalls in accessing literature on gene function, is the topic of my talk later today, “Asymmetries in Retrieval of Gene Function Information”

The Study Investigated an example of different paths to the literature that might look to a user to be equivalent but which are not equivalent due to various features of the resources involved. Knowledge that they are not equivalent requires knowledge of metadata about the resources. In that study we compared three different paths to literature on gene function that might appear to be equivalent to a user who lacks knowledge of the metadata about the information resources used.

Three Paths Affymetrix Affymetrix Affymetrix Nucleotide Gene Pubmed
Genbank Accession number Genbank Accession number Genbank Accession number Nucleotide Gene Each path starts from a microarray experiment so I want to first talk a little about microarrays. Pubmed links Pubmed links Pubmed Pubmed Pubmed Pubmed ID Pubmed ID Pubmed ID

Microarrays can be used to determine gene expression under experimental and control conditions. Each cell in a microarray holds copies of short strands of DNA called probes. The probes are used to identify particular genes. The microarray is used ”… to help researchers identify what RNA sequences are present in [an experimental or control] sample, and this then tells them how strongly those genes are being expressed by that cell.“* The microarray is washed with RNA which has been treated to fluoresce when treated with a stain. The expression level of the genes is indicated by the brightness of the resulting glow. The results of the microarray experiment are statistically analyzed to determine which genes are significantly expressed. In the workflows we consider, there is a representative DNA sequence related to the probes for a gene, and that is what is used to search for primary literature about the gene function. *

Three Paths Affymetrix Affymetrix Affymetrix Nucleotide Gene Pubmed
Genbank Accession number Genbank Accession number Genbank Accession number Nucleotide Gene Each path starts from a microarray experiment, gets the Genbank Accession numbers of the representative sequences of expressed genes, and uses those Accession numbers to search for primary literature that may shed some light on the function of the genes. Pubmed links Pubmed links Pubmed Pubmed Pubmed Pubmed ID Pubmed ID Pubmed ID

Methods We first collected representative DNA Accession numbers associated with genes expressed in a microarray experiment designed to identify changes in gene expression associated with skeletal muscle recovery from immobilization-induced sarcopenia. To compare the three paths, we collected the representative DNA Accession numbers associated with genes expressed in a microarray experiment (NIH grant AG18881) designed to identify changes in gene expression associated with skeletal muscle recovery from immobilization-induced sarcopenia.

Methods Next, we retrieved the Unique Identifiers (UI’s) of Entrez Pubmed citations that were associated with the Accession numbers by each of the three Entrez resources. Directly in the case of Entrez Pubmed Indirectly, via Pubmed links in the case of Entrez Nucleotide and Entrez Gene Next, we compared the number of Pubmed ID's retrieved by the three resources for each of the Accession numbers. Next, we retrieved the Unique Identifiers (UI’s) of Entrez Pubmed citations that were associated with the Accession numbers by each of the three Entrez resources. Directly in the case of Entrez Pubmed Indirectly, via Pubmed links in the case of Entrez Nucleotide and Entrez Gene Next, we compared the number of Pubmed ID's retrieved by the three resources for each of the Accession numbers.

Summary of Pubmed ID’s by Accession Number
numbers 198 1 36 2 10 3 4 5 Total 251 # of Pubmed ID’s Accession numbers 132 1 112 2 5 3 4 Total 251 # of Pubmed ID’s Accession numbers 216 1 34 2 3 4 5 Total 251 We collected for each Accession number the Pubmed IDs retrieved by each path. This shows a summary of the numbers of Pubmed IDs retrieved by Accession numbers for each path. Pubmed Nucleotide Gene

Methods Compared number of Pubmed ID’s produced for each Accession number by each path. Applied non-parametric test: Kendall’s W Pubmed versus Nucleotide versus Gene p < .05 We then compared the number of Pubmed IDs retrieved for each Accession number by each path. We analyzed that data with Kendall’s W. The results showed that the result sets produced by the three paths were significantly different at p < .05.

The Three Paths Are Not Equivalent
Pubmed links Genbank Accession number Pubmed ID Affymetrix Pubmed Nucleotide Gene ≠ ≠ In other words, these three different paths are not equivalent, in that they do not produce the same results.

The SI field identifies secondary source databanks and accession numbers of outside resources discussed in MEDLINE articles. The field is composed of the source followed by a slash followed by an accession number and can be searched with one or both components, e.g., genbank [si], AF [si], genbank/AF [si]. The SI field and the Entrez sequence database links are not linked. The PubMed links to these databases are created from the reference field of the GenBank or GenPept flat file. These references include citations that discuss the specific sequence presented in these flat files. The point is that a user lacking knowledge of the metadata about the resources (i.e.., indexing and other structural features) might have considered the paths equivalent. Here, for example, is documentation about the SI field that strongly suggests that the “direct to Pubmed” path and the Nucleotide path would not be equivalent.

“Based on user warrant, determine the current state of evidence-based functional genomic retrieval”
Interviews with biologists who use microarrays to study gene expression levels Questions concern what methods for IR are used, why they consider the methods effective, what are criteria of success and failure, and how they see the role of biomedical librarians in the process The work on the second aim, “Based on user warrant, determine the current state of evidence-based functional genomic retrieval”, is in progress. In this project we interview biologists who use microarrays for gene expression studies and ask them questions about what methods for IR are used, why they consider the methods effective, what are criteria of success and failure, and how they see the role of biomedical librarians in the process

Interviews in Progress
Five interviews currently scheduled at the University of Missouri-Columbia Interviews being scheduled at University of Wisconsin-Milwaukee In March we interviewed two subjects at NIG in Japan We currently have five interviews schedule for University of Missouri-Columbia, we are scheduling interviews at University of Wisconsin-Milwaukee, and we have interviewed two subjects at the National Institute of Genetics in Japan. We also have ten interviews that we did previously at the University of Missouri-Columbia and elsewhere.

“Based on literary warrant, determine the current state of evidence-based functional genomic retrieval” We wanted to investigate how and to what extent biological science researchers reported their information retrieval methods, including details of why they used the methods they did. Our third aim was “Based on literary warrant, determine the current state of evidence-based functional genomic retrieval”. In this project we wanted to investigate how and to what extent biological science researchers reported their information retrieval methods, including details of why they used the methods they did.

Methods We searched OVID Medline on October 1, 2004 for the period 1966 to September Week with the query “Oligonucleotide Array Sequence Analysis/”, producing results. We then limited the results to English (10374), excluded “review articles” (9049), and limited to the years 2003 – 2004 (4798). We next ranked journals in the results by number of articles, and selected a population of all of the articles from the 13 top journals (n=1373). We randomly sampled 150 articles from that population. We searched OVID Medline on October 1, 2004 for the period 1966 to September Week with the query “Oligonucleotide Array Sequence Analysis/”, producing results. We then limited the results to English (10374), excluded “review articles” (9049), and limited to the years 2003 – 2004 (4798). We next ranked journals in the results by number of articles, and selected a population of all of the articles from the 13 top journals (n=1373). We randomly sampled 150 articles from that population.

Methods If the authors of the paper did report gene function, we wanted to know which information sources and retrieval methods they used, as well as the reasons they had for using them. Functional Attribution Reported Sources of Information Reported Retrieval Strategy Reported Grounds for Choice of Sources Reported Grounds for Retrieval Strategy Reported If the authors of the paper did report gene function, we wanted to know which information sources and retrieval methods they used, as well as the reasons they had for using them. So we classified the relevant articles with respect to the categories “Functional Attribution Reported, “Sources of Information Reported”, “Retrieval Strategy Reported”, “Grounds for Choice of Sources Reported”, “Grounds for Retrieval Strategy Reported”.

Methods How were details of sources and retrieval methods reported?
Methods or Procedures Results Discussion Furthermore, we were interested in how details of the sources and retrieval methods they used were reported in the paper. Thus, when details of the information sources and retrieval methods used were discussed, we noted the sections of the paper in which they were discussed. For example, we noted whether information retrieval methods were discussed in the Methods or Procedures section, the Results section, or the Discussion section.

Results Typical evidence for attribution of gene function consists of literature citations. When a literature search (e.g. Pubmed search), or a search of other knowledge sources (e.g. NCBI databases), is cited as the source of evidence to support attribution of function, rarely are details of the search reported. Reasons for using sources and retrieval methods not reported. Typical evidence for attribution of gene function consists of literature citations. When a literature search (e.g. Pubmed search), or a search of other knowledge sources (e.g. NCBI databases), is cited as the source of evidence to support attribution of function, rarely are details of the search reported, certainly not in a level of detail that would allow repeatability. Its also the case that reasons for using sources and retrieval methods are not reported.

Results When information retrieval methods are described in the paper, they are typically mentioned only in the “Results” or “Discussion” sections of the paper, and not in the “Methods” section. Wet bench methods are reported in more detail than dry bench methods. Interestingly, when information retrieval methods are described in the paper, even in detail, they are typically mentioned only in the “Results” or “Discussion” sections of the paper, and not in the “Methods” section. Even perhaps more interesting is that wet bench methods are reported in much more detail than dry bench methods.

Implications for Information Practice

Implications for Information Practice
There is a need to embrace a workflow concept There is a need to develop standards for documentation in e-science There is a need to use multidisciplinary teams to develop workflows I will mention three implications for information practice suggested by our studies. There is a need to embrace a workflow concept There is a need to develop standards for documentation in e-science There is a need to use multidisciplinary teams to develop workflows

“There is a need to embrace a workflow concept”
Call a scenario of the use of a combination of multiple information resources databases and analysis tools a workflow Workflows are increasingly important for information retrieval and processing in the Life Sciences I maintain the first implication because Workflows are increasingly important in the Life Sciences. A scenario of information retrieval and processing that involves the use of multiple information resources, databases, and analysis tools in combination (like the three paths to the literature that we examined earlier) is called a workflow.

“There is a need to develop standards for documentation in e-science”
Traditional Science Computer based Information retrieval and processing The second implication is that there is a need to develop standards for documentation in e-science. It is commonly suggested, and presumably it is true, that we are witness to the ongoing digitization of science, with computer based information retrieval and processing methods increasingly being incorporated into the day to day doing of traditional science. The Digitization of Science or E-science

Life Science Information Retrieval and Processing Workflows
Presumably, embracing information processing and retrieval workflows in the Life Sciences requires that we have clear constraints on the quality (e.g. peer review) of those workflows, as well as assurance of repeatability of methods and results.

documentation Life Science Information Retrieval and Processing Workflows For this we need documentation of the details of the workflow.

documentation Life Science Information Retrieval and Processing Workflows technology to facilitate documentation In order to achieve the level of documentation that is required for quality and repeatability of methods and results with any very complicated resource composition or workflow, we need to develop technology to facilitate the documentation.

documentation Life Science Information Retrieval and Processing Workflows technology to facilitate documentation editorial policy drivers But in addition to the technology for capturing and managing provenance records, there must also be policy drivers, particularly editorial policy drivers, to insure the level of documentation of methods and results required for quality and repeatability.

“There is a need to use multidisciplinary teams to develop workflows”
INFORMATION ITEMS METADATA KNOWLEDGE-ENABLED WORKFLOWS TOOLS The third implication for information practice is that there is a need to use multidisciplinary teams to develop workflows. I think a typical situation in which workflows might be developed would be one in which we have primary information resources and items, various tools for accessing or manipulating that information, metadata describing the primary information and tools, and then workflows that use the primary information and tools, where the design of the workflows is based

KNOWLEDGE-ENABLED WORKFLOWS
INFORMATION ITEMS METADATA KNOWLEDGE-ENABLED WORKFLOWS TOOLS in part of knowledge of the primary information domain expert (scientist)

INFORMATION ITEMS METADATA KNOWLEDGE-ENABLED WORKFLOWS TOOLS domain metadata expert (information specialist) in part on knowledge of the metadata domain expert (scientist)

METADATA domain metadata expert (information specialist) In order to construct workflows both kinds of expertise are required. The domain expert is needed to provide the scientific bases for the workflow design, and the information specialist is needed for his or her expertise in the metadata of the domain. The information specialists (e.g. librarians) do not need to become experts in biology, that is, experts in the primary information and tools. But they do need to be experts in the metadata of biology. TOOLS domain expert (scientist) INFORMATION ITEMS

Evidence-Based Information Retrieval in Bioinformatics

Similar presentations

Presentation on theme: "Evidence-Based Information Retrieval in Bioinformatics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evidence-Based Information Retrieval in Bioinformatics

Similar presentations

Presentation on theme: "Evidence-Based Information Retrieval in Bioinformatics"— Presentation transcript:

Similar presentations

About project

Feedback