Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James.

Slides:



Advertisements
Similar presentations
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Advertisements

Instructions for completing the ES089g term paper.
1 A Balanced Introduction to Computer Science, 2/E David Reed, Creighton University ©2008 Pearson Prentice Hall ISBN Chapter 17 JavaScript.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Part I – MULTIVARIATE ANALYSIS
Chapter 9 Audit Sampling: An Application to Substantive Tests of Account Balances McGraw-Hill/Irwin ©2008 The McGraw-Hill Companies, All Rights Reserved.
A second example of Chi Square Imagine that the managers of a particular factory are interested in whether each line in their assembly process is equally.
Chapter 16 Chi Squared Tests.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Introduction Project goal was to develop simple way to characterize level of access to journal literature in physical sciences and engineering provided.
Multiplying, Dividing, and Simplifying Radicals
IIUM Research, Invention and Innovation Exhibition 2010 ‘ Enhancing Quality Research and Innovation for Societal Development’ Asadullah Shah 1, Aznan Zuhid.
Example 10.1 Experimenting with a New Pizza Style at the Pepperoni Pizza Restaurant Concepts in Hypothesis Testing.
Lecture 3-2 Summarizing Relationships among variables ©
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Copyright © Cengage Learning. All rights reserved. 11 Applications of Chi-Square.
1 Chapter 3: Examining Relationships 3.1Scatterplots 3.2Correlation 3.3Least-Squares Regression.
Searching Databases. What is in the Library? The Online Library has thousands of journal articles and electronic books available for your use. Also available.
Statistics and Quantitative Analysis Chemistry 321, Summer 2014.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Assignee Name Harmonization Efforts at the U.S. Patent and Trademark Office US Patent and Trademark Office Office of Electronic Information Products Patent.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Chapter 12 The Analysis of Categorical Data and Goodness-of-Fit Tests.
Date: 2013/8/27 Author: Shinya Tanaka, Adam Jatowt, Makoto P. Kato, Katsumi Tanaka Source: WSDM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Estimating.
School of Information - The University of Texas at Austin LIS 397.1, Introduction to Research in Library and Information Science LIS Introduction.
1 Psych 5500/6500 Standard Deviations, Standard Scores, and Areas Under the Normal Curve Fall, 2008.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
LIS 7450, Searching Electronic Databases Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
Anomalies in Open-Access & Traditional Biomedical Literature: A Comparative Analysis Abstract This research compares rates of anomaly and post-publication.
 Major part of psychology for researchers, students, clinicians, etc…  Difference between journal article and popular press articles  Scholarly Journal-
Nonparametric Statistics
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
A Balanced Introduction to Computer Science, 3/E David Reed, Creighton University ©2011 Pearson Prentice Hall ISBN Chapter 17 JavaScript.
1 Psych 5500/6500 Measures of Variability Fall, 2008.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.
1 Smart Searching Techniques Fall 2006 the Library.
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
RESEARCH METHODS IN TOURISM Nicos Rodosthenous PhD 07/03/ /3/2013Dr Nicos Rodosthenous1.
IE241: Introduction to Design of Experiments. Last term we talked about testing the difference between two independent means. For means from a normal.
Copyright © Cengage Learning. All rights reserved. 1.4 Fractional Expressions Fundamental Concepts of Algebra.
Changing Bases. Base 10: example number ³ 10² 10¹ 10 ⁰ ₁₀ 10³∙2 + 10²∙1 + 10¹∙ ⁰ ∙0 = 2120 ₁₀ Implied base 10 Base 8: 4110 ₈ 8³ 8².
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
: Adding and Subtracting Rational Expressions Introduction Expressions come in a variety of types, including rational expressions. A rational expression.
Chi-Square Analyses.
Outline of Today’s Discussion 1.The Chi-Square Test of Independence 2.The Chi-Square Test of Goodness of Fit.
Statistical Properties of Text
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Hypothesis Tests for 1-Proportion Presentation 9.
Descriptive Statistics Dr.Ladish Krishnan Sr.Lecturer of Community Medicine AIMST.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Comparing Observed Distributions A test comparing the distribution of counts for two or more groups on the same categorical variable is called a chi-square.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted by a model is called a goodness-of-fit.
Comparing Counts Chi Square Tests Independence.
計畫名稱:控制學門105年度成果發表會海報格式說明
Linear Algebra Review.
Experimental Psychology
Text Based Information Retrieval
Using computers to search electronic databases
MEDLINE Complete is the world’s largest full-text companion to PubMed
CS 430: Information Discovery
An ATSC Detector using Peak Combining
Chapter 17 JavaScript Arrays
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James E. Ries, Kuichun Su, Gabriel Peterson, MaryEllen C. Sievert, Timothy B. Patrick, David E. Moxley, Lawrence D. Ries CECS, HMI, Statistics, and SISLT

Abstract Retrieval tests have assumed that the abstract is a true surrogate of the entire text. However, the frequency of terms in abstracts has never been compared to that of the articles they represent. Even though many sources are now available in full-text, many still rely on the abstract for retrieval … … In these four journals, the abstracts are lexical, as well as intellectual, surrogates for the documents they represent

Background Many retrieval systems still use abstracts as a surrogates for full text. Abstracts are often indexed with respect to word occurrence by employing Zipf’s Law. –Product of occurrence frequency and rank of occurrence frequency is constant –Most occurring and least occurring words contribute little to article content.

Background (cont.) Previous studies have shown that abstracts are sometimes inconsistent with their corresponding articles. However, no study has previously shown that abstracts and articles are inconsistent in a statistical sense.

Methods 4 medical journals (BMJ, JAMA, Lancet, and NEJM) –Two different countries –Many medical subdisciplines –Regarded as top journals –Available in electronic format Studied all articles which contained an abstract and were 2 pages or longer during –1,138 articles – 35 parsing problems = 1,103 articles

Methods (cont.) Text of articles and abstracts were downloaded and stored in HTML. HTML was parsed into separate abstract and article files via custom C++ parsing program. References and figures were removed.

Methods (cont.) “Content-bearing words” extracted from abstracts and articles –Numerical values, special characters, and captions excluded and used as word delimiters Removed words contained in a home-grown “stop word list” (words with little or no medical meaning)

Methods (cont.) Remaining words conflated using NLM’s LVG tools. –E.g, “reading” -> “read” Frequencies of all conflated words were calculated for abstracts and articles.

Analysis Used chi-squared test to determine whether discrepancies between observed occurrences in abstract and occurrences in articles were due to sampling or were truly indicative of a difference in content.

Analysis (cont.) Example: Rosing (Lancet) –Abstract contained 140 content bearing words –“contraceptive” appeared 6 times in the abstract and 35 times in the text of the article. –Since text contained 1081 content bearing words, expect 140/1081 * 35 = 3.35 occurrences of this term in the abstract.

Analysis (cont.) Example: Rosing (Lancet) –Actual number of occurrences was 6, the square of the error divided by the expected was added to the chi-squared statistic for this particular word (i.e., ((6-3.35)^2)/3.35 = 2.10). –Every other content bearing word in the article was compared to the abstract in this way, and sum of all of the errors was the total chi-squared statistic for the given article.

Analysis (cont.) We reran our analysis using the Bonferroni Inequality measure to assure that we would not have incorrect results simply by virtue of our large sample size.

Cumulative Results w/o Bonferroni

Cumulative Results w/ Bonferroni

Future Work Utilize a smaller, more standard stop word list (see Su K, et. al., “Comparing Frequency of Word Occurances in Abstracts and Texts Using Two Stop Word Lists” in Fall 2001 AMIA Proceedings). Explore “over agreement”.

Future Work (cont.) Compare phrases (terms) rather than words. Utilize the UMLS to compare Concept Unique Identifiers (CUI’s) via MetaMap rather than words or phrases. –Changes in agreement/disagreement may indicate the use of synonyms which might still negatively affect retrieval.

Conclusion In these four journals, the abstracts are lexical, as well as intellectual, surrogates for the documents they represent. Our test was “conservative” in the sense that we can only strongly state that a small number of abstract/article pairs do “disagree”. However, the remaining articles can only be said to not conclusively disagree.

Acknowledgements This research was supported in part by grant T LM from the National Library of Medicine, United States of America.

Questions