1 Writeslike.us Em Tonkin, Andrew Hewson

1 Writeslike.us Em Tonkin, Andrew Hewson e.tonkin@ukoln.ac.uk a.hewson@ukoln.ac.uk

2 Background Relevant research themes: Metadata harvesting and reuse Automatic metadata extraction Text analysis Social network analysis Scholarly communication, particularly informal communication

3 Aim Helping people to find each other: Finding other researchers with similar interests to yourself in your geographic area Or in your area of research Not everybody with similar interests will attend the same conferences! Helping students find potential research supervisors Encouraging serendipity.

4 Relevant technologies In fact there are an awful lot of these. Social network analysis: Generally requires a very large dataset Solvable either by a) being Facebook or similar (but adoption rates are far from 100%) b) automated analysis of relevant data Solution b) is cheap, simple, and very fallible. Not a new approach – at the core of bibliometrics

5 Data extraction

6 Relevant technical problems Author identity disambiguation Formal social networks disambiguate between instances of individual names (for example, if there are many people called 'John Smith', the system can tell you which is which). Needs to be solved to acceptable level. Need to define how good 'acceptable' is. Formal solutions usually depend on unique identifiers + registries Cheap, moderately effective solution: disambiguate via textual characteristics + metadata

7 Methodology Harvest OAI metadata: captures large list of: Author names (somewhat randomly formatted) Digital object titles, descriptions (sometimes), dates (sometimes) and content (sometimes) Citations (sometimes) Spider digital objects, analyse them for formal metadata – retrieve email addresses, etc. Retain OAI source: useful clue regarding author affiliations (sometimes)

8 Links from OAI records

9 Links from OAI records (2) Just under half of the pages retrieved through crawling of links provided within DC records contained one or more accessible documents. Around 15% of linked pages resolved to journal endpoints – ‘paywalls’ Sometimes contain additional useful metadata about the document – not necessarily appropriate to harvest this However, the copyright ownership is in itself a useful data point. Around 40% of institutional repository links were found to contain no accessible data.

10 Links from OAI records (3) 240,000 records were harvested. Out of the 62,000 records containing an actionable http dc:identifier, 35,000 contained a handle.net (15,500) or dx.doi.org (20,000) actionable persistent identifier. DOIs and handles appear to have a similar prevalence in UK institutional repositories.

11 Methodology (II) Analyse text for noun-phrase-like structures – useful clue as to theme Background information required, such as: Institution name, domains/URLs associated with each institution Retrieved via harvesting from Wikipedia Much of this information is not well-structured, so unavailable via DBPedia Poorly structured information needs filtering: for example, author names are not consistently structured between repositories. - machine learning problem. Search with contextual network graph algorithm

12 Contextual network graph algorithm Like spilling a little ink on one node of the graph: It spreads a predefined distance through the graph of relations between authors, objects, roughly calculated identities, classifications, and other metadata, in a manner defined by the way in which the implementation is tuned. The result is a ranked list of matching nodes and their types, which can then be presented to the user.

13 'Sometimes' and 'usually' Statistics are: Cheap Imperfect Available Rapid innovation philosophy: Cheap is good Simple is good Solutions requiring novel/additional uptake of infrastructure are out of reach

14 Results Basic concept worked well Law of diminishing returns: beyond the first 80-90%, increasing effort led to only minor improvements in dataset (minor niggles!) Interface development actually required more time than the dataset development, and exceeded project length... But useful dataset can be released as linked data, reused for various purposes

15 Results (2) A random sample of authors shows that authors with few publications have little visibility in the formal indexes. Low-quality publication, or early-career researcher?

16 Caveat (emptor?) Collecting data has legal implications. Displaying data has legal implications, especially when the site is presented as able to perform specific functions – such as “analysing research impact” Realistic solution: Disclaimer: “[Nobody] makes any warranty whatsoever that the operation of the Site will be [...]error-free; that defects will be corrected; […] as to the results that may be obtained from[...] the Site; or as to the accuracy, completeness, reliability, availability, suitability, quality, non- infringement or operation of any Content, product or service provided on or accessible from[...] the Site.”

17 Future work Exploring the legal issues Alternative uses of data Targeted interface development Integration of additional tools/search methods

18 Walkthrough: Basic search (the harder method!)

19 Advanced search

24 Walkthrough

25 Conclusion OAI-DC (and Wikipedia!) is a good source for 'semi-structured' data There is a great deal of potential for using this together with appropriate analysis tools, such as those explored within the FixRep project, to develop social network- like graphs Application of this type of data for the purpose of encouraging informal academic communication/collaboration is an interesting research field with many potential applications

1 Writeslike.us Em Tonkin, Andrew Hewson

Similar presentations

Presentation on theme: "1 Writeslike.us Em Tonkin, Andrew Hewson"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Writeslike.us Em Tonkin, Andrew Hewson

Similar presentations

Presentation on theme: "1 Writeslike.us Em Tonkin, Andrew Hewson"— Presentation transcript:

Similar presentations

About project

Feedback