ETD 2005 International Accesses to a Digital Library of ETDs
ETD 2005 Ana Pavani Departamento de Engenharia Elétrica Pontifícia Universidade Católica do Rio de Janeiro
Presentation outline Profile of the digital library Generation of data Combination and anaysis of data – interesting results Next steps
Profile of the digital library Beginning of the collection – 2 nd semester of 1995 Items to start the collection – courseware (texts, exercises, technical manuals, tests, etc.)
The digital library is part of a system that: Is a LMS (Learning Management System) Has administrative functions that allow data exchange with the university’s administrative system Is linked (2 directions) to CNPq’s Lattes Platform (curricula database with more than 595 K CV) Allows the control of series collections Is multilingual and has interfaces in 3 languages
Evolution of the collection: Administrative documents Preprints, published papers & online articles Interactive courseware ETDs (2000) Online journals (2003) Senior projects (2003) Online bulletins – distributed through mailing lists, archived and published automatically (2004) Books (Oct. 2005)
Numbers of titles in the collection: Courseware (many types) – 2,700+ Administrative documents – 33 Technical documents – 94 ETDs – 1873 (PUC-Rio) + 31 (UNICAP) Preprints, published papers & online articles – 280 Senior projects – 305 Online journals – 3 (+ 1 in Oct in Dec. 2005) Online bulletins – 2 Books – 1 (to be published in Oct. 2005) Total number of digital objects (DOs) : 16,400+
Technological characteristics: Machine – IBM RS/6000 Operating system – IBM AIX Web server – Apache DBMS – IBM DB2 ALL Apache log contains info on accesses to ALL digital contents on the system, besides all transaction that users perform (clicking buttons, reading posts, reading help pages, etc.) – data on transactions with contents must be extracted from the server log to generate the numbers to be analyzed
Generation of data Data have 2 different natures: production and accesses Production data come from functions of the system that are not related to the Apache server but only to the DB example
(*) PUC-Rio started requiring ETDs in Aug. 2002; (*) UNICAP does not require ETDs.
Access data are obtained from both the Apache Server log and the DB: Logs are mined (according to the following definitions) and the results are stored on the DB Mined data are combined with production data (metadata) already in the database (types of contents, authors, programs, areas of knowledge, dates, countries, etc.) to yield results
Definitions for mining the log When access statistics came into discussion, it was necessary to define how data should be mined from the log and how it should be combined afterwards The definitions follow – (M) mining definitions and (C) combining definitions
(M) Visits and complete visits An ETD can have one or many digital objects. The number of visits is the sum of all accesses to all digital objects in a given month. A complete visit is a set of visits to all digital objects from a country in a given month.
(M) Country x IP address The decision to use the country and not the IP address to establish a visit was based on the fact that the visits to an ETD can be made at different times (and reconnecting may assign a new IP address) and from different locations (with fixed IP addresses).
(M) Counting visits from the same IP address Visits from the same IP are counted individually due to the fact that networks with many machines can be identified by the IP address of a firewall.
(M) Counting visits to restricted digital objects Some ETDs are totally or partially restricted – approximately 30% have some type of permanent or temporary restriction. Metadata, abstracts included, are publicly available for all of them. It was decided that attempts followed by denials of access would be counted as accesses. !! This is informed in the help pages of the system; it is suggested that authors should consider allowing their contents to become public if many attempts occur.
(C) Lines to mine Since the interest was on access to digital objects, the decision was to get the lines with extensions.dcr,.doc,.htm,.pdf, etc. All possible extensions on the database are considered, as long as the corresponding item is cataloged on the digital library, so that an eventual static html system page is not counted.
Observations (1)Statistics were planned on a monthly basis. The model treats data as sequences of points with discrete-time intervals of a month. Past months data are unchanged and current month is updated according to the Update definition. (2)IPs are resolved using a plug-in called GeoIP Free that is available with AWStats.
(C) Information to get from a log line The month and the year are extracted along with identification of the digital object and the country of the IP address that accessed the digital object.
(C) Update of the DB The lines are read every hour at the full hours (00:00, 01:00, etc.); incremental lines are mined. Accesses are summed for each month-year-DO-country, so the table is not very big – in the first 6 months of 2005 the average number of lines per month was 10,000.
(C) When to start computing The log of the Apache Server started being saved on Jun 01, So, either this date was used or a later one, for example Jan 01, The decision was to use all available monthly logs. When the process started, some days of offline processing were required. Afterwards update became automatic according to the Update definition.
Observations (1)Maybe these were not the best definitions – we are willing to discuss alternatives!! (2)The (original) logs are stored and saved offline in case some change in the minig strategy is decided (we have not sunk the ships!!).
Definitions for computing statistics By author Visited ETDs by year, month and country Visited ETDs by country, month and year 25 most visited ETDs (on the system = PUC-Rio + UNICAP) 20 most visited ETDs by institution
10 most visited ETDs by graduate program Visited ETDs by institution, program, year and month
Initial Results
# ETDs may/sep – 13% # accesses may/sep – 54.6% Access to ETDs is increasing (Sep 28, 2005)
# ETDs may/sep – 13% # accesses may/sep – 54.6% Number of total visits is increasing (Sep 28, 2005)
# ETDs may/sep – 13% # accesses may/sep – 54.6% Accumulated average total visits is increasing (Sep 28, 2005)
But… Brazil + pt speaking + es speaking = 75% Brazil + US + pt speaking + es speaking = 87% Brazil accounts for 55% of the accesses since Jun 01, 2004 (Sep 28, 2005)
On Jun 15, 2007 the numbers of ETDs in Iberian languages on the NDLTD DB were Brazilian ETDs were 83% of all ETDs in Iberian languages (total number 13,369) InstitutionCountryLanguage(s)Number National LibraryPortugalPortuguese185 IBICT (includes PUC-Rio) BrazilPortuguese11,118 UABSpain (Catalunya)Catalan or English or Spanish1,011 UIBSpain (Catalunya)Catalan or English or Spanish22 UJISpain (Catalunya)Catalan or English or Spanish42 UOCSpain (Catalunya)Catalan1 UPCSpain (Catalunya)Catalan or English or Spanish415 UPFSpain (Catalunya)Catalan or English or Spanish67 URLSpain (Catalunya)Spanish1 URVSpain (Catalunya)Catalan or English or Spanish106 UdGSpain (Catalunya)Catalan or English or Spanish131 UdLSpain (Catalunya)Catalan or English or Spanish70 UVSpain (Catalunya)Catalan or English or Spanish200
Percentage of visits from Brazil is decreasing (Sep 28, 2005)
Accumulated percentage averages of visits from Brazil (Sep 28, 2005)
Total accesses top 10 countries (Sep 28, 2005) # identified countries unindentified countries + satellite access host CountryVisits Brazil12,845 USA2,795 Portugal1,489 Spain679 Peru652 Mexico 432 Chile364 France245 Colombia225 Argentina224
Some interesting results Some ETDs are permanent ‘best sellers’ They are on specific subjects (examples: a specific phylosopher and history of modern architecture in Brazil) They are linked from sites on the subjects (examples: the first from the US & Brazil and the second from Germany) They are accessed from different countries Some topics are permanent ‘best sellers’ (example: energy)
Some ETDs are temporary ‘best sellers’ – this seems to happen when they are displayed at the ‘last published ETDs’ functions (system and graduate program) Some graduate programs are permanent ‘best sellers’ They research topics that are very specific of the country (examples: education and history of culture) They are indexed in other sites and/or digital libraries (examples: Universia in Spain for social sciences and humanities) They are accessed from different countries
The 25 most visited ETDs have a large number of visits No average is lower than 100 visits per month
Next steps Find out how readers got to ETDs (BDTD, NDLTD, SCIRUS, etc.) – an online survey is planned Interview faculty to check if some ETDs are recommended reading in courses Gather more data and analyze in a ‘more scientific’ manner (must find a student!!)
Develop additional functions comparing accesses with production Extend to other digital contents (at the moment only ETDs and online journals have access statistics)
Thank you! Muito obrigada!