Organizing current awareness in a large volunteer-based digital library Thomas Krichel
outline Background to work that we did –RePEc (Research Papers in Economics) –NEP: New Economics Papers The research –Theory –Method –Results Other work done for NEP.
This talk has three parts Some background Two papers –chablis paper, with Nisa Bakkalbasi (Yale) – shibuya paper
RePEc Digital library for academic Economics. It collects descriptions of –economics documents (working papers, articles etc) –collections of those documents –economists –collections of economists
RePEc principle Many archives –Archives offer metadata about digital objects or authors and institutions data. One database Many services –Users can access the data through many interfaces. –Providers of archives offer their data to all interfaces at the same time. This provides for an optimal distribution.
it's the incentives, stupid RePEc applies the ideas of open source to the construction of bibliographic dataset. It provides an open library. The entire system is constructed in such a way as to be sustainable without monetary exchange between participants.
some history Thomas Krichel in the early 1990s dreamed about a current awareness service for working paper. It would later have electronic papers. In 1993 he made the first economics working paper available online. In 1997 he wrote the key protocols that “govern” RePEc.
RePEc is based on 550+ archives WoPEc EconWPA DEGREE S-WoPEc NBER CEPR Elsevier US Fed in Print IMF OECD MIT University of Surrey CO PAH Blackwell
to form a 362+k item dataset 171,000working papers 187,000journal articles 1,300software components 2,100book and chapter listings 9,000author contact & publication listings 9,300institutional contact listings more records than
RePEc is used in many services EconPapers NEP: New Economics Papers Inomics RePEc author service Z39.50 service by the DEGREE partners IDEAS RuPEc EDIRC LogEc CitEc
NEP: New Economics Papers This is a set of current awareness reports on new additions to the working paper stock only. Journal articles would be too old. Founded by Thomas Krichel in Supported by the Economics department at WUStL. Initial software was written by Jose Manuel Barrueco Cruz. First general editor was John S. Irons.
why NEP Public aim: Current awareness if well done, can be an important service in its own right. It is sheltered from the competition of general search engines. Private aim: It is useful to have some, even though limited classification information. –for performance measures –for general research purposes
modus operandi: stage 1 The general editor uses a computer program who gathers all the new additions to the working paper stock. This is usually done weekly. S/he filters out new descriptions of old papers –date field –handle heuristics The result is an issue of the nep-all report.
modus operandi: stage 2 Editors consider the papers in the nep-all report to filter out papers that belong to the subject. This forms as issue of a subject report nep-???. nep-all and the subject reports are circulated via . A special arrangement makes the data of NEP available to other RePEc services.
some numbers The are now 60+ NEP lists. Over 39k subscriptions. Over to 16k subscribers. Over 50k papers announced. Over 100k announcements. Homepage at All this is a fantastic success!!
problem with the private aim We would have to have all the papers to be classified not only the working papers. We would need to have 100% coverage of NEP. This means every paper in nep-all appears in at least one subject report.
coverage ratio We call the coverage ratio the number of papers in nep-all that have been announced in at least one subject report. We can define this ratio –for each nep-all issue –for a subset of nep-all issues –for NEP as a whole
coverage ratio theory & evidence Over time more and more NEP reports have been added. As this happens, we expect the coverage ratio to increase. However, the evidence, from research by Barrueco Cruz, Krichel and Trinidad is –The coverage ratio of different nep-all issues varies a great deal. –Overall, it remains at around 70%. We need some theory as to why. This is where the chablis paper comes in.
two theories Target-size theory Quality theory –descriptive quality –substantive quality
theory 1: target size theory When editors compose a report issue, they have a size of the issue in mind. If the nep-all issue is large, editors will take a narrow interpretation of the report subject. If the nep-all ratio is small, editors will take a wide interpretation of the report subject.
target size theory & static coverage There are two things going on –The opening new subject reports improves the coverage ratio. –The expansion of RePEc implies that the size of nep-all, though varying in the short-run, grows in the long run. Target size theory implies that the coverage ratio deteriorates. The static coverage ratio that we observe is the result of both effects canceling out.
theory 2: quality theory George W. Bush version of quality theory –Some papers are rubbish. They will not get announced. –The amount of rubbish in RePEc remains constant. –This implies constant coverage. Reality is slightly more subtle.
two versions of quality theory Descriptive quality theory: papers that are badly described –misleading titles –no abstract –languages other than English Substantive quality theory: papers that are well described, but not good –from unknown authors –issued by institutions with unenviable research reputation
practical importance We do care whether one or the other theory is true. –Target size theory implies that NEP should open more reports to achieve perfect coverage. –Quality theory suggests that opening more report will have little to no impact on coverage. Since operating more reports is costly, there should be an optimal number of reports.
overall model We need an overall model that explains subject editors behavior. We can feed this model with variables that represent theoretical determinants of behavior. We can then assess the strength of various factors empirically.
method The dependent variable is announced. It is one if the paper has been announced, 0 otherwise. Since we are explaining a binary variable, we can use binary logistic regression analysis (BLRA). This is a fairly flexible technique, useful when the probability distributions governing the independent variables are not well known. That's why BLRA is popular in the life sciences.
independent variables: size size is the size of the nep-all issue in which the paper appeared. This is the critical indicator of target size theory. We expect it to have a negative impact on announced.
independent variables: position position is the position of the paper in the nep-all issue. The presence of this variable can be justified by the combined assumption of target size and editor myopia. If editors are myopic, they will be more liberal at the start of nep-all then at the end of nep-all.
independent variables: title title is the length of a title of the paper, measured by the number of characters. This variable is motivated by descriptive quality theory. A longer title will say more about the paper than a short title. This makes is less likely that a paper is being overlooked.
independent variables: abstract abstract isthe presence/absence of an abstract to the paper. This is also motivated by descriptive quality theory. Note that we do not use the length of the abstract because that would be a highly skewed variable.
independent variables: language language is an indicator if the language of the metadata is in English or not. This variable is motivated by descriptive quality theory and the idea that English is the most commonly understood language. While there are a lot of multilingual editors, customizing this variable would have been rather hard.
independent variables: series series is the size of the series where a paper appears in. This variable is motivated by substantive quality theory. The larger a series is the higher, usually, is its reputation. We can roughly qualify by size and quality –multi-institution series (NBER, CEPR) –large departments –small departments
independent variables: author author is the prolificacy of the authors of the paper. It is justified by substantive quality theory. This is the most difficult variable to measure. We use the number of papers written by the registered author with the highest number. Since about 50% of the papers have no registered author, a lot of them are excluded. But there should be no bias by the exclusion.
create categorical variables size_1[179, 326) size_2[326, 835] title_1[55, 77) title_2[77, 1945] position_1[0.357, 0.704) position _2[0.704, 1.000] series_1 [98, 231) series_2 [231, 3654]
results P(announced=1| x) =(exp(g(x))/(1+exp(g(x)) g(x) = *size_ * size_ *title_ *title_ *abstract *author *language *series_ *series_2 position is not significant. author just makes the cut.
odds ratio size_11.32[1.22, 1.44] size_20.83[0.76, 0.90] title_11.16[1.07, 1.26] title_21.28[1.18, 1.39] abstract1.47[1.34, 1.62] language2.15[1.85, 2.51] series_11.11[1.02, 1.20] series_21.37[1.26, 1.49] author1.05[1.01, 1.09]
scandal! Substantive quality theory can not be rejected. That means that the editors are selecting for quality as well as for the subject. The editors have rejected our findings. Almost all protest that there is no quality filtering. This is where the chablis paper ends.
consequences There has been no program to expand list. There has to be a concentrated effort to help editors to find subject specific papers. More effort needs to be made for editors to really find the subject-specific papers. This can be done by –the use of a more efficient interface –the use of automated resource discovery methods.
ernad editing reports on new academic documents. It is purpose-built software system for current awareness reports. It has been designed by Thomas Krichel, The design is complicated, but the system quite easy to use. The system was written by Roman D. Shapiro.
statistical learning The idea is that a computer may be able to make decision on the current nep-all reports based on the observation of earlier editorial decisions. ernad now works using support vector machines (SVM), with titles, abstracts, author name, classification values and series as features.
SVM performance If we use average search length, we can do performance evaluations. It turns out that reports have very different forecastability. Some are almost perfect, others are weak. Again, this raises a few eyebrows!
what is the value of an editor? If the forecast is perfect, we don't need the editor. If the forecast is very weak the editor may be a prankster.
pre-sorting reconceived We should not think of pre-sorting via SVM as something to replace the editor. We should not think about it encouraging editors to be lazy. Instead, we should think it as an invitation to examine some papers more closely than others.
headline vs. bottomline data The editors really have a three stage process of decision. –They read title, author names. –They read the abstract. –They read the full text A lot of papers fail at the first hurdle. SVM can read the abstract and prioritize papers for abstract reading. Editors are happy with the pre-sorting system.
performance evaluation This is really where the shibuya paper starts. How should the success or failure of a sorting algorithm be quantified? Classic information retrieval suggests precision and recall.
precision and recall precision is the number of retrieved and relevant documents divided by the number of retrieved documents. recall is the number of retrieved and relevant documents divided by the number of relevant documents. Both numbers are used together but recall is often difficult to measure.
precision and recall problem Precision and recall really apply to "large" IR problems, where the set of documents is too large to be examined "by hand". Users only see the set of retrieved papers. Here we have a "small" information retrieval problem.
P&R interpretation 1 We can argue that when we sort nep-all recall is always constant 100% Precision is the number of relevant papers in the issue, divided by the size of nep-all. This does not depend on the sorting process.
P&R interpretation 2 We can look at the precision achieved at the last retrieved paper. This is a measure that is equivalent to one measure I will present later, that essentially looks at how low the last paper has fallen. But recall is still useless.
P&R interpretation 3 We could the vector coming out of the sorting process to a set. We can then compare –set of predicted useful documents –set of actual used documents But this would mean deliberately throwing away information. And under this criteria different orders, which should widely differ for editors, can get the same evaluation.
we need some different theory! We will look at some simple theory of editor behavior. This theory is a bit like an economic theory in the sense that it has been made under ridiculously simplifying assumptions. The hope is that the theory sheds light into basic features of the problem that remain operational under more realistic assumptions.
key assumption 1: binary decision An editor faces a list of documents. Each document describes a working paper that has been added to RePEc recently. The editor examines the document. An editor may spend a varying amount of effort examining a document. This would be a very complex decision to model. We assume it away. Thus we assume a document is examined or not.
key assumption 2: no learning The decision whether a document is relevant or not is assumed to only depend on the contents of that document. It is assumed not to depend on the contents on any other document. This assumption assumes away learning.
introducing cost-based reasoning Editors face an optimal stopping problem. There are two types of costs that editors are facing. –the cost of examining a new paper c_1. We can safely assume that c_1 is constant. –the cost associated with loosing papers c_2. It will depend on the number of papers lost. It c_2>0, it will be unknown.
c_1 and c_2 c_1 and c_2 seem to dictate editor behavior –If c_1 >> c_2 the editor will not examine any documents. –If c_1 << c_2 the editor will examine all documents. Let us assume that the editor is conscientious. That is, c_1 and c_2 are such that, while there is a chance that there are some more relevant documents left, the editor will continue to examine the list.
the traffic light We still have a complicated problem. Only a totally unrealistic assumption can safe us. Basically, let us assume that there is no uncertainty about c_2. This is the traffic light assumption: –A traffic light shows green as long as there are more relevant documents to be discovered. –The traffic light shows red
conscientious editor & traffic light Under the traffic light scenario the conscientious editor will examine papers until the light shows red. Therefore –c_2=0 –examination cost is c_2 i* where i* is the position of the last relevant paper is x.
what have we learned? When presented with a series of outcomes, the editor will prefer the one where the last position of a relevant document is lower. This defines a weak ordering over all outcomes.
relaxing the traffic light Assume that there could be some uncertainty about the traffic light at the end of the examination process. Assume that it is so small that the behavior of the editor would be unchanged. Contrast –ranking A: …1010…0 –ranking B:…0110…0 Then A should be preferred over B.
the natural order Repeating the previous argument, we can find a full ordering over all outcomes that a rational and conscientious editor will have. I am sure the optimality of that order could be confirmed for more general scenarios. But that is a matter of conviction.
notation We consider a nep-all report has n papers. r of the papers are relevant. x is an outcome vector. –x_i=0 if the paper at position i is not relevant. –x_i=1 if the paper at position i is relevant.
natural order when n=5, r= (read column first)
measuring success Let f(x) be a measure of the goodness of an outcome. It appears natural to require [A] f(x) > f(x') if x is better than x' [B]f(1,…,1,0…0) = 1 [C]E f(x) = 0, where E is the expected value operator about the entire set of outcome. [D]respect for the natural order [C] calls for a closed form of the expected value.
Brookes & Swets measures Brookes and Swets measure on z, the internal ranking variable. The measure the true discriminating value of z. It is difficult to build a measure by transformation that satisfies [B] and [C]. It will not satisfy [D].
the average search length This is the average position of a relevant document, divided by n. This can be transformed to satisfy [B] and [C]. The problem remains that it does not satisfy [D]. Using a simple change such as taking the logarithm of the position does not help.
Cooper's expected search length This (roughly) is the number of non-relevant documents found until a target number of relevant documents has been found. This can be transformed to satisfy [B], [C], It can weakly impose [D]. But all outcomes where the same document is at the last position are considered the equivalent. This is a problem.
natural order implementation I One way is to use powers. Construct a penalty y**x_n+y**x_n-1+… y**x_1 where y>1. It is possible to find the expected value of this expression and construct a measure that satisfies [B], [C], and [D]. Exact values depend on y.
natural order implementation II Another way is to count the items in the natural orders, starting at zero say. Finding the expected value is trivial, in this case. But we need an algorithm that quickly finds the position of an outcome in the order. Such an algorithm is described in the paper.
test We extract author names, titles, abstracts, series id, and classification codes. We do a straight feature count, then normalize for the Euclidian norm. We set aside 300 observations for testing, the rest for learning. We use SVM_light. We conduct 100 tests per report.
results Coopers measure does worse than the linear measures such as the average search length. The direct imposition measures show very high values many times. This is the case when they have been able to lift the last observation, say, into the first half.
conclusion Since –Cooper's measure and the direct imposition measure essentially measure the same order, –Cooper's measure gives relatively low values, –direct imposition measures give high values I conclude that a linear combination of Cooper's measure and direct imposition measure II seems the way forward to measure performance.
to do list Answer the question: Why did I ever get into this rather convoluted topic But now we have a criterion, we can seen if we can improve by other methods –bigrams and RePEc keyword values –different SVM settings –different algorithms
Thank you for your attention!