Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken, Katholieke Universiteit Leuven, Leuven (Belgium) 2.Institute for Research Organisation, Hungarian Academy of Sciences, Budapest (Hungary) 3.Inforsk, Department of Sociology, Umeå University, Umeå (Sweden)
Introduction Goal: mapping of scientific processes Map of scientific papers Characterization of emerging clusters Extraction of new search keys Using bibliometric as well as lexical indicators of ‘relatedness’ Full-text analysis
Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion
Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion
Data source 19 full-text papers from: Scientometrics, Vol 30, Issue 3 (2004) special issue on 9 th international conference on Scientometrics and Informetrics (Beijing, China) Validation setup Manual assignment in various classes..
Data source Section codeSection namePaper I Advances in Scientometrics Havemann et al. (2004) Moed and Garfield (2004) Small (2004) Yue and Wilson (2004) II Policy relevant issues Negishi et al. (2004) Shelton and Holdrige (2004) Markusova et al. (2004) Wu et al. (2004) III Bibliometric approaches to collaboration in science Beaver (2004) Kretschmer (2004) Persson et al. (2004) Yoshikane and Kageura (2004) IV Advances in Informetrics and Webometrics Lamirel et al. (2004) Qiu and Chen (2004) Tang and Thelwall (2004) Vaughan and Wu (2004) V Mathematical models in Informetrics and Scientometrics Egghe (2004) Glänzel (2004) Shan et al. (2004)
Research questions Comparison text-based mapping vs. expert classification Extracted keywords Comparison with bibliometric mapping
Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion
Methodology Given a set of documents,
Methodology Given a set of documents, compute a representation, called index
Methodology Given a set of documents, compute a representation, called index to retrieve, summarize, classify or cluster them
Methodology Document processing Remove punctuation & grammatical structure (‘Bag of words’ ) Define a vocabulary Identify Multi-word terms (e.g., tumor suppressor) (phrases) Eliminate words low content (e.g., and, thus,.. ) (stopwords) Map words with same meaning (synonyms) Strip plurals, conjugations,... (stemming) Define weighing scheme and/or transformations (tf-idf,svd,..)
Methodology Compute index of textual resources: T 1 T 3 T 2 vocabulary Similarity between documents Salton’s cosine:
Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion
Results – Term statistics 19 papers 3610 withheld terms (including ~400 bigrams) Distance Matrix (19x19) Apply MDS Apply Clustering
Results – MDS
Policy Mathematical approaches Webometrics
Results – Clustering Hierarchical clustering Ward method Cut-off k=4 Optimal parameters ? ‘Stability-based method’ Quantified correspondence with expert assignments ? ‘Rand index’.. ?
Results – Peer evaluation Class Cluster IIIIIIIVV Policy Mathematical approaches Webometrics Rand index = p-value (w.r.t to permuted data) < ; significant
Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion
Results – Reference age Histograms per paper
Results – Reference age Histograms aggregated by expert class
Results – Ref Age vs. % Serial Scatter plot of Expert classes: Mean Reference Age vs. Percentage of Serials
Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion
Results – Term extraction Calculation of seminal keywords for each article Using TF-IDF weighting scheme Normalized to norm 1 to accommodate for document length
Author(s):Persson et al. Inflationary bibliometric values: the role of scientific collaboration and the need for relative indicators in evaluative studies Author(s):Glänzel Towards a model for diachronous and synchronous citation analyses co_author diachronous_prospect collabor* synchronous domest* synchronous_retrospect self_citat* age explan* diachronous_prospect Growth technic*_reliabl* reference_list citat*_process intern*_collabor* life_time reference_behaviour impact_measur* inflationari random_select* Author(s):Moed and Garfield In basic science the percentage of 'authoritative' references decreases as bibliographies become shorter Author(s):Shelton and Holdrige The US-EU race for leadership of science and technology, Qualitative and quantitative indicators research_field EU authorit*_docum* WTEC authorit* panel docum* output_indic* referenc* NAS percent_most leadership refer*_list world refer* input frequent*_cite row persuasion panelist Author(s):Tang and Thelwall Class:IV department intern*_inlink gTLD public_impact disciplin* psychologi command region histori disciplinari_differ*
Results – Full-text vs Abstract Is a full-text analysis warranted for term extraction ? for mapping purposes ?
Results – Full-text vs Abstract Less structure Less overlap with expert classes: Rand index = p-value = ; not significant Full-text is an interesting source for additional keywords and improved mapping
Conclusion Keyword approach may be naïve But applied in a systematic framework in combination with ‘right’ algorithms, it provides interesting clues Complementary to bibliometric approaches Weak indications towards benefits of using full-text articles Future: extension of this pilot to larger samples
References Bibliometrics; homepage Wolfgang Glänzel Bibliometrics; homepage Olle Persson Text & Data mining; PhD thesis Patrick Glenisson ftp://ftp.esat.kuleuven.ac.be/pub/sista/glenisson/reports/phd.pdf Optimal k in clustering;Stability method