Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken,

Slides:

Advertisements

Similar presentations

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Advertisements

Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.

SCIENTROMETRIC By Preeti Patil. Introduction The twentieth century may be described as the century of the development of metric science. Among the different.

BIBLIOMETRICS Presented by Asha. P Research Scholar DOS in Library and Information Science Research supervisor Dr.Y.Venkatesha Associate professor DOS.

BIBLIOMETRICS – USE AND LIMITATIONS Wolfgang Glänzel KU Leuven, Belgium ISPR, HAS, Hungary.

Katholieke Universiteit Leuven – ESAT/SCD – Steunpunt O&O Indicatoren /24 Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text.

Overview of Text Mining SCD.  Text SCD Introduction  Text mining SCD  Started around 2000  Currenty 1 postdoc, 4 PhD students.

What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

Automatic Classification of Accounting Literature Nineteenth Annual Strategic and Emerging Technologies Workshop Vasundhara Chakraborty, Victoria Chiu,

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

Measuring Scholarly Communication on the Web Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK Bibliometric Analysis.

3. Challenges of bibliometrics DATA: many problems linked to the collection of data FIELDS – DISCIPLINES : various classifications, not satisfactory INDICATORS.

Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.

Aims Correlation between ISI citation counts and either Google Scholar or Google Web/URL citation counts for articles in OA journals in eight disciplines.

An Overview of Link Analysis Techniques for Academic Web Sites Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.

Scalable Text Mining with Sparse Generative Models

Patterns of International and National Web Inlinks to US University Departments Rong Tang Catholic University of America, USA Mike Thelwall University.

Hyperlinks and Scholarly Communication Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK Virtual Methods Seminar, University.

Chapter 5: Information Retrieval and Web Search

The Changing Role of Intangibles over the Crisis Intangibles & Economic Crisis & Company’s Value : the Analysis using Scientometric Instruments Anna Bykova.

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

Some facets of knowledge management in mathematics Wolfram Sperber (Zentralblatt Math) Patrick Ion (Math Reviews) Facets of Knowledge Organization A tribute.

SciTech Strategies, Inc. BETTER MAPS BETTER DECISIONS Science Mapping and Applications: Choices and Trade-offs Kevin W. Boyack, SciTech Strategies Standards.

CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.

WRITING A REVIEW ARTICLE STRUCTURE AND STYLE OF A REVIEW ARTICLE Saleem Saaed Qader MBChB, MD, MSc, MPH, PhD, SBGS Consultant General Surgeon, Lecturer.

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

A Framework for Examning Topical Locality in Object- Oriented Software 2012 IEEE International Conference on Computer Software and Applications p

Bibliometrics: coming ready or not CAUL, September 2005 Cathrine Harboe-Ree.

Bibliometric research methods Faculty Brown Bag IUPUI Cassidy R. Sugimoto.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.

Chapter 6: Information Retrieval and Web Search

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.

 PhD defense Patrick Glenisson Integrating Scientific Literature With Large Scale Gene Expression Analysis Promotor Prof. Bart De Moor June 11 th 2004.

Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.

Universiteit Antwerpen Conference "New Frontiers in Evaluation", Vienna, April 24th-25th Reliability and Comparability of Peer Review Results Nadine.

Extracting Information from the Links in Academic Webs Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK An overview.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.

Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.

Analyzing Text with SQL Server 2014, R, AND Azure ML Dejan Sarka.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

THE BIBLIOMETRIC INDICATORS. BIBLIOMETRIC INDICATORS COMPARING ‘LIKE TO LIKE’ Productivity And Impact Productivity And Impact Normalization Top Performance.

Sul-Ah Ahn and Youngim Jung * Korea Institute of Science and Technology Information Daejeon, Republic of Korea { snowy; * Corresponding Author: acorn

Data Mining and Text Mining. The Standard Data Mining process.

INTRODUCTION TO BIBLIOMETRICS 1. History Terminology Uses 2.

EERQI Final Conference, Brussels, March 2011 This project is funded by the Socioeconomic Sciences and Humanities Section. EERQI Innovative Indicators.

Literature Review: Conception to Completion

Altmetrics: Analysis of Library and Information Science (LIS) Research in the Social Media Ifeanyi J. Ezema (Ph.D) Paper Presented at the 1st International.

Applying Key Phrase Extraction to aid Invalidity Search

Martin Rajman, EPFL Switzerland & Martin Vesely, CERN Switzerland

Information Science in International Perspective

Scientific communication in the electronic age – Definitions

Analyzing and Organizing Information

EERQI Innovative Indicators and Test Results

Presentation transcript:

Combining Full-text analysis & Bibliometric Indicators a pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 1.Steunpunt O&O Statistieken, Katholieke Universiteit Leuven, Leuven (Belgium) 2.Institute for Research Organisation, Hungarian Academy of Sciences, Budapest (Hungary) 3.Inforsk, Department of Sociology, Umeå University, Umeå (Sweden)

Introduction Goal: mapping of scientific processes Map of scientific papers Characterization of emerging clusters Extraction of new search keys Using bibliometric as well as lexical indicators of ‘relatedness’ Full-text analysis

Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Data source 19 full-text papers from: Scientometrics, Vol 30, Issue 3 (2004)  special issue on 9 th international conference on Scientometrics and Informetrics (Beijing, China) Validation setup Manual assignment in various classes..

Data source Section codeSection namePaper I Advances in Scientometrics Havemann et al. (2004) Moed and Garfield (2004) Small (2004) Yue and Wilson (2004) II Policy relevant issues Negishi et al. (2004) Shelton and Holdrige (2004) Markusova et al. (2004) Wu et al. (2004) III Bibliometric approaches to collaboration in science Beaver (2004) Kretschmer (2004) Persson et al. (2004) Yoshikane and Kageura (2004) IV Advances in Informetrics and Webometrics Lamirel et al. (2004) Qiu and Chen (2004) Tang and Thelwall (2004) Vaughan and Wu (2004) V Mathematical models in Informetrics and Scientometrics Egghe (2004) Glänzel (2004) Shan et al. (2004)

Research questions Comparison text-based mapping vs. expert classification Extracted keywords Comparison with bibliometric mapping

Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Methodology Given a set of documents,

Methodology  Given a set of documents, compute a representation, called index

Methodology  Given a set of documents, compute a representation, called index to retrieve, summarize, classify or cluster them

Methodology Document processing Remove punctuation & grammatical structure (‘Bag of words’ ) Define a vocabulary Identify Multi-word terms (e.g., tumor suppressor) (phrases) Eliminate words low content (e.g., and, thus,.. ) (stopwords) Map words with same meaning (synonyms) Strip plurals, conjugations,... (stemming) Define weighing scheme and/or transformations (tf-idf,svd,..)

Methodology Compute index of textual resources: T 1 T 3 T 2 vocabulary Similarity between documents  Salton’s cosine:

Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Results – Term statistics 19 papers 3610 withheld terms (including ~400 bigrams) Distance Matrix (19x19) Apply MDS Apply Clustering

Results – MDS

Policy Mathematical approaches Webometrics

Results – Clustering Hierarchical clustering Ward method Cut-off k=4 Optimal parameters ? ‘Stability-based method’ Quantified correspondence with expert assignments ? ‘Rand index’.. ?

Results – Peer evaluation Class Cluster IIIIIIIVV Policy Mathematical approaches Webometrics Rand index = p-value (w.r.t to permuted data) < ; significant

Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Results – Reference age Histograms per paper

Results – Reference age Histograms aggregated by expert class

Results – Ref Age vs. % Serial Scatter plot of Expert classes: Mean Reference Age vs. Percentage of Serials

Overview Data sources and Questions asked Text mining Ingredients Text-based relational analysis of documents Contrasts with bibliometric analysis Term extraction from full-text Conclusion

Results – Term extraction Calculation of seminal keywords for each article Using TF-IDF weighting scheme Normalized to norm 1 to accommodate for document length

Author(s):Persson et al. Inflationary bibliometric values: the role of scientific collaboration and the need for relative indicators in evaluative studies Author(s):Glänzel Towards a model for diachronous and synchronous citation analyses co_author diachronous_prospect collabor* synchronous domest* synchronous_retrospect self_citat* age explan* diachronous_prospect Growth technic*_reliabl* reference_list citat*_process intern*_collabor* life_time reference_behaviour impact_measur* inflationari random_select* Author(s):Moed and Garfield In basic science the percentage of 'authoritative' references decreases as bibliographies become shorter Author(s):Shelton and Holdrige The US-EU race for leadership of science and technology, Qualitative and quantitative indicators research_field EU authorit*_docum* WTEC authorit* panel docum* output_indic* referenc* NAS percent_most leadership refer*_list world refer* input frequent*_cite row persuasion panelist Author(s):Tang and Thelwall Class:IV department intern*_inlink gTLD public_impact disciplin* psychologi command region histori disciplinari_differ*

Results – Full-text vs Abstract Is a full-text analysis warranted for term extraction ? for mapping purposes ?

Results – Full-text vs Abstract Less structure Less overlap with expert classes: Rand index = p-value = ; not significant Full-text is an interesting source for additional keywords and improved mapping

Conclusion Keyword approach may be naïve But applied in a systematic framework in combination with ‘right’ algorithms, it provides interesting clues Complementary to bibliometric approaches Weak indications towards benefits of using full-text articles Future: extension of this pilot to larger samples

References Bibliometrics; homepage Wolfgang Glänzel Bibliometrics; homepage Olle Persson Text & Data mining; PhD thesis Patrick Glenisson ftp://ftp.esat.kuleuven.ac.be/pub/sista/glenisson/reports/phd.pdf Optimal k in clustering;Stability method