Presentation is loading. Please wait.

Presentation is loading. Please wait.

Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Similar presentations


Presentation on theme: "Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,"— Presentation transcript:

1 Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken, Katholieke Universiteit Leuven, Leuven (Belgium) 2.Institute for Research Organisation, Hungarian Academy of Sciences, Budapest (Hungary) 3.Inforsk, Department of Sociology, Umeå University, Umeå (Sweden)

2 Introduction Goal: mapping of scientific processes Map of scientific papers Characterization of emerging clusters Extraction of new search keys Using bibliometric as well as lexical indicators of ‘relatedness’ Full-text analysis

3 Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

4 Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

5 Data source 19 full-text papers from: Scientometrics, Vol 30, Issue 3 (2004)  special issue on 9 th international conference on Scientometrics and Informetrics (Beijing, China) Validation setup Manual assignment in various classes..

6 Data source Section codeSection namePaper I Advances in Scientometrics Havemann et al. (2004) Moed and Garfield (2004) Small (2004) Yue and Wilson (2004) II Policy relevant issues Negishi et al. (2004) Shelton and Holdrige (2004) Markusova et al. (2004) Wu et al. (2004) III Bibliometric approaches to collaboration in science Beaver (2004) Kretschmer (2004) Persson et al. (2004) Yoshikane and Kageura (2004) IV Advances in Informetrics and Webometrics Lamirel et al. (2004) Qiu and Chen (2004) Tang and Thelwall (2004) Vaughan and Wu (2004) V Mathematical models in Informetrics and Scientometrics Egghe (2004) Glänzel (2004) Shan et al. (2004)

7 Research questions Comparison text-based mapping vs. expert classification Extracted keywords Comparison with bibliometric mapping

8 Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

9 Methodology Given a set of documents,

10 Methodology  Given a set of documents, compute a representation, called index

11 Methodology  Given a set of documents, compute a representation, called index to retrieve, summarize, classify or cluster them

12 Methodology Document processing Remove punctuation & grammatical structure (‘Bag of words’ ) Define a vocabulary Identify Multi-word terms (e.g., tumor suppressor) (phrases) Eliminate words low content (e.g., and, thus,.. ) (stopwords) Map words with same meaning (synonyms) Strip plurals, conjugations,... (stemming) Define weighing scheme and/or transformations (tf-idf,svd,..)

13 Methodology Compute index of textual resources: T 1 T 3 T 2 vocabulary Similarity between documents  Salton’s cosine:

14 Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

15 Results – Term statistics 19 papers 3610 withheld terms (including ~400 bigrams) Distance Matrix (19x19) Apply MDS Apply Clustering

16 Results – MDS

17 Policy Mathematical approaches Webometrics

18 Results – Clustering Hierarchical clustering Ward method Cut-off k=4 Optimal parameters ? ‘Stability-based method’ Quantified correspondence with expert assignments ? ‘Rand index’.. ?

19 Results – Peer evaluation Class Cluster IIIIIIIVV 134100 200030 300103 410210 Policy Mathematical approaches Webometrics Rand index = 0.778 p-value (w.r.t to permuted data) < 10 -3 ; significant

20 Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

21 Results – Reference age Histograms per paper

22 Results – Reference age Histograms aggregated by expert class

23 Results – Ref Age vs. % Serial Scatter plot of Expert classes: Mean Reference Age vs. Percentage of Serials

24 Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

25 Results – Term extraction Calculation of seminal keywords for each article Using TF-IDF weighting scheme Normalized to norm 1 to accommodate for document length

26 Author(s):Persson et al. Inflationary bibliometric values: the role of scientific collaboration and the need for relative indicators in evaluative studies Author(s):Glänzel Towards a model for diachronous and synchronous citation analyses co_author0.417794diachronous_prospect0.492265 collabor*0.287652synchronous0.377403 domest*0.208460synchronous_retrospect0.360994 self_citat*0.185298age0.250921 explan*0.170916diachronous_prospect0.238375 Growth0.154099technic*_reliabl*0.180497 reference_list0.151925citat*_process0.150553 intern*_collabor*0.151925life_time0.147679 reference_behaviour0.151468impact_measur*0.125460 inflationari0.151468random_select*0.114862 Author(s):Moed and Garfield In basic science the percentage of 'authoritative' references decreases as bibliographies become shorter Author(s):Shelton and Holdrige The US-EU race for leadership of science and technology, Qualitative and quantitative indicators research_field0.358836EU0.638957 authorit*_docum*0.281942WTEC0.346503 authorit*0.241017panel0.224208 docum*0.197558output_indic*0.142678 referenc*0.179418NAS0.142678 percent_most0.179418leadership0.142678 refer*_list0.176746world0.119689 refer*0.165171input0.114998 frequent*_cite0.156779row0.102220 persuasion0.153787panelist0.101913 Author(s):Tang and Thelwall Class:IV department0.420497 intern*_inlink0.315920 gTLD0.273798 public_impact0.189552 disciplin*0.148494 psychologi0.145234 command0.145234 region0.135706 histori0.123676 disciplinari_differ*0.105307

27 Results – Full-text vs Abstract Is a full-text analysis warranted for term extraction ? for mapping purposes ?

28 Results – Full-text vs Abstract Less structure Less overlap with expert classes: Rand index = 0.6257 p-value = 0.464 ; not significant Full-text is an interesting source for additional keywords and improved mapping

29 Conclusion Keyword approach may be naïve But applied in a systematic framework in combination with ‘right’ algorithms, it provides interesting clues Complementary to bibliometric approaches Weak indications towards benefits of using full-text articles Future: extension of this pilot to larger samples

30 References Bibliometrics; homepage Wolfgang Glänzel http://www.steunpuntoos.be/wg.html Bibliometrics; homepage Olle Persson http://www.umu.se/inforsk/Staff/olle.htm Text & Data mining; PhD thesis Patrick Glenisson ftp://ftp.esat.kuleuven.ac.be/pub/sista/glenisson/reports/phd.pdf Optimal k in clustering;Stability method


Download ppt "Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,"

Similar presentations


Ads by Google