Presentation is loading. Please wait.

Presentation is loading. Please wait.

Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries.

Similar presentations


Presentation on theme: "Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries."— Presentation transcript:

1 Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries

2 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Context Narrative Story telling The Library's story, and the Archives story, but also…

3 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Users’ stories Scholars' stories Adding context through recombinant metadata

4 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Scholars & Users Stories – Tim Sherratt (@wragge) Also: http://discontents.com.au/a-map-and-some-pins-open-data-and-unlimited-horizons/http://discontents.com.au/a-map-and-some-pins-open-data-and-unlimited-horizons/

5 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Library Authority Data “Include links to other URIs. so that they can discover more things.” Short of providing and linking to URIs, this *is* authority data. This is what our authority files are for.

6 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Linked data is about context authorities provide context and yet our controlled vocabs are nearly gone because the interfaces to them were broken

7 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

8

9 The Death of Browse Next-Gen Discovery Systems don't make use of Authority Control “Browse” was/is broken as a UI Design Rich data in Authorities, disconnected from narrative, context, search Richer “Authority” type data outside libraries... “Next Gen Next Gen Discovery…

10 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

11

12

13

14 Fuzzy Wuzzy – Seat Geek Fuzzy Wuzzy – Awesome Library from SeatGeek https://github.com/seatgeek/fuzzywuzzy http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python

15 Slide courtesy of Doug Oard Univ. of Maryland

16 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Tools - Natural Language Processing DBPedia Spotlight https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki Zemanta: http://www.zemanta.com/?wpst=1http://www.zemanta.com/?wpst=1 Open Calais: http://www.opencalais.com/http://www.opencalais.com/ Open Refine: http://openrefine.org/http://openrefine.org/ DataTXT: https://dandelion.eu/products/datatxt/https://dandelion.eu/products/datatxt/ AlchemyAPI: http://www.alchemyapi.com/http://www.alchemyapi.com/ FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzyhttps://github.com/seatgeek/fuzzywuzzy

17 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

18

19 Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts

20 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Linked Jazz Back End

21 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Primo PNX and Authorities Indexing Cross References New Browse Functionality Authority Control from Aleph / Alma What about non-MARC, or non- Aleph Data? Matching Strings to Authorities

22 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Enter Open Refine http://freeyourmetadata.org/

23 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Match strings to vocabularies…

24 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Like LCNAF…

25 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Or Wikipedia

26 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Automated Authority Control?

27 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

28 Open Refine RDF Skeleton

29 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014

30 Proposed System Architecture

31 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Hydra Modeling & Architecture Approaches to Provenance Prov-O Named Graphs Named Datastreams “n” nyucore “records” Same properties defined for each Keep data sources separate Merge for display in Blacklight & export to Primo

32 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Separate Metadata Datastreams source_metadata, enrich_metadata Reload one or both without affecting other or native metadata native_metadata Edited only through Hydra UI Partitioned from external sources

33 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Metadata Provenance

34 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Fedora Datastreams

35 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Blacklight User Interface

36 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts

37 A Role for Ex Libris Alma &/or Primo Named Entity Recognition Vocabulary Reconciliation Provenance Management Primo Central Named Entity Recognition on Full Text Auto Classification

38 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 A bit louder... we need new interfaces we need enterprise tools Integrated into our metadata management systems for new kind of catalogers for knowledge organization experts

39 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Simplified Workflow Proposal

40 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 More Tools – At Programming Level Open NLP: https://opennlp.apache.org/https://opennlp.apache.org/ Stanford Natural Language Toolkit: http://nlp.stanford.edu/software/index.shtml http://nlp.stanford.edu/software/index.shtml Python Tools SciKitLearn, Pandas, NLTK, SciPi, NumPi https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience http://pandas.pydata.org/ http://www.nltk.org/

41 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 More Data Science-ey Tools http://www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html

42 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Data Science Techniques Feature Extraction / Feature Engineering Predictive Modeling Probabilistic Classification – Large Multi-Class Problems Text Analytics Vectorization Bags & Sets of Words TF/IDF N-Grams Sparse Matrices

43 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Simple Example – Predict Yelp Star Ratings

44 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Fitting a Model – Naïve Bayes

45 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Data Science Venn Diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

46 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323

47 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Where can we go from here? NER is just the beginning Feature Engineering Hiring Statisticians Clustering & Classification Vocabulary Pruning and Engineering Manageable 10-20k Class Text Classification Problems Domain Specific Ex Libris’ Activity in this space

48 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Thanks! corey.harper@nyu.edu 212.998.2479 @chrpr


Download ppt "Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries."

Similar presentations


Ads by Google