Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining

Similar presentations


Presentation on theme: "Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining"— Presentation transcript:

1 Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

2 Topics What is text mining? (briefly) What can it offer? (selectively) What are the obstacles? (mostly)

3 NaCTeM First publicly-funded (JISC) national text mining centre in the world Remit: provide services to research community Initial focus on biology, then social sciences, medicine, chemistry, … Processing on a large scale, e.g. for UKPMC (Wellcome T.+17 other funders) www.nactem.ac.uk

4 What is text mining? Goal: Discover new knowledge from old How: –Process very large amounts of text Millions of documents, the more the better –Identify and extract information –(Link extracted information to already curated knowledge) –Mine to discover implicit significant associations –Flag (unknown) associations for researcher to investigate further –Spin-off on the way: render information explicit

5 From text to new knowledge

6 What does it offer? Finds unsuspected knowledge –E.g. Disease-gene associations Enables discoveries human effort could not achieve (information overload/overlook) Enables better search/navigation of literature –Semantic search via extracted semantic metadata Reduces time spent searching –15-48% of researcher time spent on classic search, 20-50% of classic searches unsatisfied E.g. Systematic reviews: months to weeks

7 What does it offer? Text mining boosts research –Makes research possible that would otherwise be impossible or unfeasible Research drives growth and innovation Research produces more information More information is available for text mining Text mining boosts research …

8 Barriers Access to the literature Format issues (tied to next point…) –“PDF is evil” (Lynch) Main blocks: copyright and licensing issues –<8% of scientific claims found in full article appear in its abstract (Blake) –Abstracts deficient on argumentation, discussion, methods, background, … –Full texts needed to realise full benefits of TM

9 Barriers Need to copy documents to analyse them Licences typically not favourable to TM Licences established on per institution basis –Prevents community-oriented services Results only for internal use by institutional users –Hinders mining over collections of content from different providers Inconsistency: human can search and manually analyse, but cannot use machine to do same job on same data already subscribed to

10 Barriers Problem even with liberal OA licences –Author attribution required Author attribution in a data mining environment is impossible/unfeasible –Association finding: cannot track positive, negative, neutral individual author contributions Derived works in a TM environment –Every author of every text processed to produce new derived knowledge may have a claim… –Rights clearance thus an effective barrier

11 Barriers Laudable effort 1: NESLi2 model licence (JISC Collections) allows TM –Publisher <> single institution –But how many publishers retain TM provisions? –But cannot display annotations produced by TM on document itself Laudable effort 2: NPG licence for self- archived content allows TM –But “content must be destroyed when experiment complete” is vague. So services for community?

12 Conclusion Copyright and licensing restrictions block full realisation of TM benefits –Economic savings and potential for growth are stifled Japan has introduced an information analysis exception to copyright law –National Diet Library (= British Library) has recently changed its motto to: “Through knowledge we prosper” –Can we say the same in the UK?

13 Extras

14 Info=degree of surprise Finding unknown associations: reproducing a discovery reported 5 days ago in Nature Medicine

15 UKPMC EvidenceFinder by NaCTeM: Questions generated by deep analysis, with known answers

16 Click on a question to see relevant extracted evidence (from OA subset of the archive)


Download ppt "Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining"

Similar presentations


Ads by Google