Bibliometrics and preference modelling Thierry Marchant Ghent University
Some academic rankings
Top 5% Authors, as of April 2008 Average Rank Score
Outline Why rank ? Which attributes? Some popular rankings. How can we motivate a ranking ? The axiomatic approach. Comparing peers and apples
Why rank ?
Why rank universities ? To choose one for studying (bachelor student). To attract good students (good university). To obtain subsidies (good university). To allocate subsidies (government). To allocate students to various universities in function of their score at an exam (government)....
Why rank departments ? To choose one for studying (doctoral student). To attract good students (good department). To obtain subsidies (good department). To allocate subsidies (government). To allocate students to various departments in function of their score at an exam (government)....
Why rank scientists ? To determine the salary (university). To award a scientific distinction (scientific society). To hire a new scientist (university). To choose a thesis director (student). To evaluate a department or university (...). To evaluate a journal (...). To allocate subsidies (government)....
Why rank journals ? To choose one for publishing (scientist). To maximize the dissemination of one’s results. To maximize one’s value. To evaluate a scientist (...). To evaluate a department (...). To evaluate a university (...). To improve one’s image (good publisher)....
Why rank articles ? To select articles (scientist). To evaluate a scientist (...). To evaluate a departement (...). To evaluate a university (...). To evaluate a journal (...)....
Focus in this talk Rankings of scientists Rankings of departments Rankings of universities Rankings of journals Rankings of articles
Which attributes ?
Many relevant attributes Quality –Evaluation by peers –Quality of the journals –Citations (#, authors, journals, +/-) –Coauthors –Patents –Awards –Budget Quantity –Number of papers –Number of books –Number of pages –Coauthors (#) –Number of patents –Citations (#) –Awards –Budget –Number of thesis students Various –Age –Carreer length –Land –Nationality –Discipline –Century –University
Bibliometric attributes Quality –Evaluation by peers –Quality of the journals –Citations (#, authors, journals, +/-) –Coauthors –Patents –Awards –Budget Quantity –Number of papers –Number of books –Number of pages –Coauthors –Number of patents –Citations (#) –Awards –Budget –Number of thesis students Various –Age –Carreer length –Land –Nationality –Discipline –Century –University
Bibliometric attributes Quality –Evaluation by peers –Quality of the journals –Citations (#, authors, journals, +/-) –Coauthors –Patents –Awards –Budget Quantity –Number of papers –Number of books –Number of pages –Coauthors –Number of patents –Citations (#) –Awards –Budget –Number of thesis students Various –Age –Carreer length –Land –Nationality –Discipline –Century –University
Bibliometric attributes Why using bibliometric attributes ? Cheap Objective ? Reliable ?
Some popular rankings of scientists
Some popular rankings Number of publications Total number of citations Maximal number of citations Number of publications with at least citations. Average number of citations The same ones weighted by Number of authors Number of pages Impact factor The same ones corrected for age h-index, g-index, hc-index, hI-index, R-index, A-index, …
The h-index Published in 2005 by physicist G. Hirsch. 462 (1267) citations in March 2009 (May 2013). Adopted by Web of Science (ISI, Thomson). The h-index is the largest natural number x such that at least x of his/her papers have at least x citations each. h-index = 6
How to justify a ranking ? THE true and universal ranking does not exist.
How to justify a ranking ? THE true and universal ranking does not exist. Two departments: 50 scientists with 2000 citations 3 scientists with 180 citations
How to justify a ranking ? THE true and universal ranking does not exist. If one knows the true ranking, one may compute some correlation between the true one and another one.
How to justify a ranking ? THE true and universal ranking does not exist. If one knows the true ranking, one may compute some correlation between the true one and another one. Assessing the Accuracy of the h- and g-Indexes for Measuring Researchers’ Productivity, Journal of the American society for information science and technology, 64(6):1224–1234, “The analysis quantifies the shifts in ranks that occur when researchers’ productivity rankings by simple indicators such as the h- or g-indexes are compared with those by more accurate FSS.”
How to justify a ranking ? THE true and universal ranking does not exist. If one knows the true ranking, one may compute some correlation between the true one and another one. Assume a law linking the numbers of papers and citations to the quality of the scientist (unobserved variable) and his age. This law may be probabilistic. Derive then an estimation of the quality of a scientist from his data (papers and citations).
How to justify a ranking ? THE true and universal ranking does not exist. If one knows the true ranking, one may compute some correlation between the true one and another one. Assume a law linking the numbers of papers and citations to the quality of the scientist (unobserved variable) and his age. This law may be probabilistic. Derive then an estimation of the quality of a scientists from his data (papers and citations). Analyze the mathematical properties of rankings.
Characterization of scoring rules
Definitions Set of journals : J = { j, k, l, …} Paper: a paper in journal j with x citations and a coauthors is represented by the triplet (j,x,a). Scientist: mapping f from J× N × N to N. The number f(j,x,a) represents the number of publications of author f in journal j with x citations and a coauthors. Set of scientists: set X of all mappings from J× N × N to N such that Σ j ∈ J Σ x ∈ N Σ a ∈ N f(j,x,a) is finite. Bibliometric ranking : weak order ≥ on X (complete and transitive relation).
Scoring rules Scoring rule : a bibliometric ranking is a scoring rule if there exists a real-valued mapping u defined on J× N × N such that f ≥ g iff Σ j Σ x Σ a f(j,x,a) u(j,x,a) ≥ Σ j Σ x Σ a g(j,x,a) u(j,x,a) Examples : u(j,x,a) = 1 # papers u(j,x,a) = x # citations u(j,x,a) = x/(a+1) # citations weighted by # authors u(j,x,a) = IF(j) # papers weighted by impact factor …
Axioms Independence: for all f, g in X, all j in J, all x, a in N, we have f ≥ g iff f + 1 j,x,a ≥ g + 1 j,x,a.
Axioms Independence: for all f, g in X, all j in J, all x, a in N, we have f ≥ g iff f + 1 j,x,a ≥ g + 1 j,x,a. > + 1 paper in j, with x citations with a coauthors + 1 paper in j, with x citations with a coauthors > f g
Axioms Archimedeanness: for all f, g, h, e in X with f > g, there is a natural n such that e + nf ≥ h + ng.
Axioms Archimedeanness: for all f, g, h, e in X with f > g, there is a natural n such that e + nf ≥ h + ng. < e h + f : 10 papers with 20 citations + g : 1 paper with 1 citation + f : 10 papers with 20 citations + g : 1 paper with 1 citation + f : 10 papers with 20 citations + g : 1 paper with 1 citation + f : 10 papers with 20 citations + g : 1 paper with 1 citation ≥
Axioms Independence: for all f, g in X, all j in J, all x, a in N, we have f ≥ g iff f + 1 j,x,a ≥ g + 1 j,x,a. Not satisfied by the max # of citations or h-index. Reversal with the h-index when adding 2 papers. Archimedeanness: for all f, g, h, e in X with f > g, there is an integer n such that e + nf ≥ h + ng. Not satisfied by the max # of citations, h-index, lexicographic ranking.
Result Theorem : A bibliometric ranking satisfies Independence and Archimedeanness iff it is a scoring rule. Furthermore u is unique up to a positive affine transformation. Proof: (X, +, ≥ ) is an extensive measurement structure as in [Luce, 2000]. (X, +) is a cancellative (f+g = f+h g=h) monoid. It can be extended to a group (X’, +) by the Grothendieck construction. (X’, +, ≥ ) is an Abelian and Archimedean linearly ordered group. It is isomorphic to a subgroup of the ordered group of real numbers (Hölder).
Special case: u(j,x,a) = x /(a+1). Transfer: for all j in J, all x, y, a in N, we have 1 j,x,a + 1 j,y+1,a ~ 1 j,x+1,a + 1 j,y,a ( u affine in # citations). Condition Zero: for all j in J, all a in N, there is f in X such that f + 1 j,0,a ~ f ( u linear in # citations). Journals Do Not Matter: for all j, j’ in J, all a, x in N, 1 j,x,a ~ 1 j’,x,a ( u independent of journal). No Reward for Association: for all j in J, all m, x in N with m >1, 1 j,x,0 ~ m 1 j,x,m-1 ( u inversely proportional to # authors).
Characterization of conjugate scoring rules for scientists and departments
Introduction Consider two departments each consisting of two scientists. The scientists in department A both have 4 papers, each one cited 4 times. The scientists in department B both have 3 papers, each one cited 6 times. Both scientists in department A have an h-index of 4 and are therefore better than both scientists in department B, with an h-index of 3. Yet, department A has an h-index of 4 and is therefore worse than department B with an h-index of 6. Hence, the “best” department contains the “worst” scientists.
Definitions Scientist: mapping f from N to N. The number f(x) represents the number of publications of scientist f in with x citations. Set of scientists: set X of all mappings from N to N such that Σ x ∈ N f(x) is finite. Ranking of scientists : weak order ≥ s on X. Department : vector of scientists Set of all departments denoted by Y. Ranking of departments : weak order ≥ d on Y.
Scoring rules Scoring rule : a ranking of scientists is a scoring rule if there exists a real-valued mapping u defined on N such that f ≥ s g iff Σ x f(x) u(x) ≥ Σ x g(x) u(x) Scoring rule : a ranking of departments is a scoring rule if there exists a real-valued mapping u defined on N such that (f 1, …, f k ) ≥ d (g 1, …, g l ) iff Σ i Σ x f i (x) v(x) ≥ Σ j Σ x g j (x) v(x) Conjugate scoring rules : ≥ s and ≥ d are conjugate scoring rules if u = v.
Axioms Consistency: if f i ≥ s g i, for i = 1, …, k, then (f 1, …, f k ) ≥ d (g 1, …, g k ). In addition, if f i > s g i, for some i, then (f 1, …, f k ) > d (g 1, …, g k ). Totality: if (f 1, …, f k ) and (g 1, …, g l ) are such that Σ i f i = Σ j g j, then (f 1, …, f k ) ~ d (g 1, …, g l ). Dummy : (f 1, …, f k ) ~ d (f 1, …, f k, 0).
Result Theorem : ≥ s and ≥ d satisfy Consistency, Totality, Dummy and Archimedeannness of ≥ s iff they are conjugate scoring rules. Furthermore u is unique up to a positive affine transformation.
Discussion
Axiomatic analysis of more rankings is needed. Axiomatic analysis of indices is different but also relevant. Consistency is important (e.g. h-index for scientists and IF for journals).
Literature Scientometrics Journal of Informetrics Journal of the American Society for Information Science and Technology
Comparing peers and apples
Comparing scientists of different ages h-index = h-index =
Instead of h-index, use an index that is independent of time. For instance, the average number of citations per paper, i.e. Σ x ∈ N x f(x)/ Σ x ∈ N f(x) Problem: suppose f has one paper with 50 citations and g has 10 papers with 40 citations. Divide the h-index by the length of the carreer Problem: the h-index is not a linear function of time Comparing scientists of different ages
Comparing across disciplines The average number of citations per paper is 80 times larger in medicine than in mathematics. Any comparison of scientists across disciplines, using an index based on citations is therefore flawed. Field normalization: for a given index, compute the distribution of the index in each field (medicine, physics, economics, mathematics, literature, …). Define then the normalized index of a scientist as his/her percentile. Problem: the definition of a field is arbitrary. The average number of citations per paper is 20 times larger in physics than in mathematics. But only 2-3 times in theoretical physics.
Source field normalization Papers in medicine are often cited. This implies that they have long reference lists. Papers in mathematics have short reference lists. Instead of defining disciplines or fields, use the length of the reference list to normalize. Thus, divide the number of citations received by a paper by the length of the reference list.
Distributions
Lotka’s law Proportion of scientists with n papers : F(n) = C/n a with C ≃ 2 and a depending on the field.
Non universal power law Peterson Pressé and Dill, Proceedings of the National Academy of Sciences, 107, Direct citations : the probability that a new paper will randomly cite paper A is P direct = 1/N, with N the total number of published papers. Indirect citations : the author of the new paper may first find a paper B and learn of paper A via B’s reference list. P indirect = k/Nn, with k the number of existing citations to A and n the average length of the reference list.
Non universal power law (ctd) Fraction of the N papers with k citations :