Presentation is loading. Please wait.

Presentation is loading. Please wait.

Article Review Study Fulltext vs Metadata Searching Brad Hemminger School of Information and Library Science University of North Carolina.

Similar presentations


Presentation on theme: "Article Review Study Fulltext vs Metadata Searching Brad Hemminger School of Information and Library Science University of North Carolina."— Presentation transcript:

1 Article Review Study Fulltext vs Metadata Searching Brad Hemminger bmh@ils.unc.edu School of Information and Library Science University of North Carolina at Chapel Hill

2 Background Traditionally most researchers have searched for scholarly information through bibliographic databases which match search keywords against the metadata that describes the content, with journal articles being the most common form of content [Hersh, 2006]. Examples of commonly used bibliographic databases include PubMed and the ISI Web of Knowledge. The metadata description serves as a surrogate for the complete article itself. With the advent of electronic (digital) versions of articles being available, there has been an increased interest in searching the complete, or “full-text”, article itself. Many publishers are beginning to support full-text searching of their on-line content.

3 Background The Pew survey for OCLC in 2003 [Online Computer Library Center, 2005] found that the vast majority of people (89%) turn to search engines to initiate their searches for information while few use library web pages (2%) or online databases (2%). Even academic research scientists prefer search engines over library web pages for their information searching for research purposes [Hemminger, 2005] and are increasing turning to meta- search interfaces like Google Scholar to perform full text searches.

4 Research Question While it is clear that full-text matches of search strings yield more matches than just searching for matches within the metadata of articles, it is not evident how many more matches or previously undiscovered articles are found on average, or how relevant they are. It is often simply assumed that finding additional articles will automatically be of greater value to the searcher. However, as users have discovered when faced with millions of search engine hits to sort through, more is not always better.

5 ArabidopsisSchizophrenia Article Discovery Set  Plant Cell  Plant Physiology  Genes Development  Journal of Experimental Biology  PNAS  (13,991 total articles)  PNAS  The American Journal of Human Genetics  American Journal of Psychiatry  Archives of General Psychiatry  (12,314 total articles) Article Review Base Set Three major journals selected in research area, covering 1994- 2005.  Plant Cell  Plant Physiology  Genes Development  American Journal of Psychiatry  American Journal of Human Genetics  PNAS Gene NamesCandidates (5175) Article Review Subset (10) Candidates (26597) Article Review Subset (15) Article Review Study Set Metadata Articles (18) Full-Text Articles (82) Total (100) Metadata Articles (19) Full-Text Articles (83) Total (102) Article Review Training Set Metadata Articles (3) Full-Text Articles (17) Metadata Articles (3) Full-Text Articles (9)

6 Article Discovery Schizophrenia + Schizophrenia Gene Schizophrenia GeneArabidopsis Gene Genes Found in Metadata Only 172 8.58%354120.63%27128.83% Genes Found in Full- text Only 167183.38%1012558.99%570518.57% Genes Found in Metadata and Full-text 1618.03%349820.38%2230572.60% Totals for Found Genes 20041716430722

7 Article Review Study Two literature cohorts, –Schizophrenia (Pat Sullivan) –Arabidopsis (Todd Vision) Each cohort had three readers Readers are asked to “review the article and judge its relevance to them as someone new to the gene in this biological setting, trying to build an understanding of the state of knowledge in that research area.”

8 Rating Scale for Reviewing Articles RatingRating NameRating Usage 1Definitely UsefulRight on topic, very helpful, primary initial study, excellent review, etc. 2Probably UsefulOn topic and potentially important material 3Possibly UsefulHas some material or references that are likely useful, but not certain without further checking 4Probably Not UsefulUnlikely, but may have some use, for instance references to check out 5Definitely Not UsefulNot on topic; nothing of direct value, not worth keeping.

9 Metadata Articles More Valuable In both cases and for all observers, their mean quality rating values were lower (more useful) for the metadata discovered articles. There were statistically significant differences between the mean quality rating for the metadata discovered articles versus the full-text discovered articles for the both the Arabidopsis and Schizophrenia sets at the p < 0.05 level

10 Precision and Recall SchizophreniaArabidopsis RecallPrecisionRecallPrecision Metadata discovered15.7% (16.6%) 94.7%84.1% (84.1%) 100% Full-text only discovered100%63.7%100%69%

11 Article Features that correlate with Value: Number of Hits The number of hits or matches of the search term within the returned document is a commonly used feature to rank returned articles. To test the value of this feature, the number of hits was correlated with the mean quality ranking for each article (averaged across all observers). The results clearly show a relationship where articles with many matches of the search term, tend to be much more highly valued.

12 Improving Relevance for Metadata Searching Repeating the calculations on the schizophrenia and Arabidopsis article review sets, but limited to only matches with high hit counts (Schizophrenia ≥ 20 hits and Arabidopsis ≥ 15 hits) shows that precision for the full text is now the same (100% in Aradidopsis) or slightly better than that of the metadata retrieved articles (95% versus 94.4% in schizophrenia). However, the number of additional cases discovered by full-text searching is now only slightly better, finding 5% more cases in schizophrenia and 28% more in Arabidopsis.

13 Conclusions This suggests that rather than accepting metadata searching as a surrogate for full-text searching, it may be time to make the transition to direct full text searching as the standard. This could be accomplished by using certain features of the full-text article, such as number of hits of the search string or whether the search string is found in the metadata (i.e. our current metadata search) as filters that allow us to increase the precision of our results. (and put the user in control of the filtering).

14

15 Schizophrenia ObserverABCMean Mean Ratings3.292.513.052.95 Mean Ratings (Fulltext)3.582.713.273.19 Mean Ratings (Metadata)2.051.632.111.93 Difference in Mean Rating (Fulltext - Metadata) 1.531.081.161.26

16 Arabidopsis ObserverDEFMean Mean Ratings 3.092.832.85 2.92 Mean Ratings (Fulltext) 3.433.003.07 3.17 Mean Ratings (Metadata) 1.562.061.83 1.82 Difference in Mean Rating (Fulltext - Metadata) 1.870.941.241.35

17 Schizophrenia Gene GroupRangeMean Rating ValueDifferent from Groups A1-4 hits3.24C B5-19 hits2.88C C20 or more hits1.62A,B

18 Arabidopsis GroupRangeMean Rating ValueDifferent from Groups A1-4 hit3.41C B5-14 hit2.94C C15 or more hits1.69A,B

19 SchizophreniaArabidopsis Search TermNumber of Matches Percentage of Articles Matched Mean Reviewer Rating for Article Class Number of Matches Percentage of Articles Matched Mean Reviewer Rating for Article Class SPDG 87.843.0439 3.13 SGDO 21.961.672020.002.27 SGDD 2524.513.3900.00 DGDD 109.803.9000.00 MUTANT 1110.781.552121.002.05 FAMILY 21.962.004242.002.71 SEQUENCE 2625.491.941717.001.92 INTERACTION 43.923.332525.002.23 PROCESS 2827.452.464444.002.32 STRUCTURE 76.862.1077.001.76 UP 00.00 2626.002.56 DOWN 00.00 22.002.33 REVIEW 1817.652.591010.002.63 MARKER 10.984.001717.003.31 FP 21.964.0055.003.87 REFERENCE 3635.293.221313.004.03 TABLE 32.943.4488.003.29 MIP 3837.253.463636.003.55 IMG 1514.713.3811.001.33 Text670.662.80900.902.79 ReferencesOnly330.323.3190.094.19 Letter30.033.2210.014.00 Errata10.012.3300.00

20 Results First, that full-text searching can perform as well as or better than metadata searching in precision and recall. Second, that the best solution might be to provide a dynamic interface allowing the user to trade off between precision and recall by controlling the threshold of the number hits by which the results are filtered.

21 Schizophrenia + Schizophrenia Gene Schizophrenia GeneArabidopsis Gene Genes Found in Metadata Only 172 8.58%354120.63%27128.83% Genes Found in Full-text Only 167183.38%1012558.99%570518.57% Genes Found in Metadata and Full- text 1618.03%349820.38%2230572.60% Totals for Found Genes20041716430722 Genes not found32751345432749829472372703 Overall Total327515458 72403425


Download ppt "Article Review Study Fulltext vs Metadata Searching Brad Hemminger School of Information and Library Science University of North Carolina."

Similar presentations


Ads by Google