Download presentation
Presentation is loading. Please wait.
Published byJameson Lucey Modified over 10 years ago
1
Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper 2014-09-16 A brief intro to machine learning & data science for Libraries
2
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Context Narrative Story telling The Library's story, and the Archives story, but also…
3
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Users’ stories Scholars' stories Adding context through recombinant metadata
4
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Scholars & Users Stories – Tim Sherratt (@wragge) Also: http://discontents.com.au/a-map-and-some-pins-open-data-and-unlimited-horizons/http://discontents.com.au/a-map-and-some-pins-open-data-and-unlimited-horizons/
5
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Library Authority Data “Include links to other URIs. so that they can discover more things.” Short of providing and linking to URIs, this *is* authority data. This is what our authority files are for.
6
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Linked data is about context authorities provide context and yet our controlled vocabs are nearly gone because the interfaces to them were broken
7
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
9
The Death of Browse Next-Gen Discovery Systems don't make use of Authority Control “Browse” was/is broken as a UI Design Rich data in Authorities, disconnected from narrative, context, search Richer “Authority” type data outside libraries... “Next Gen Next Gen Discovery…
10
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
14
Fuzzy Wuzzy – Seat Geek Fuzzy Wuzzy – Awesome Library from SeatGeek https://github.com/seatgeek/fuzzywuzzy http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python
15
Slide courtesy of Doug Oard Univ. of Maryland
16
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Tools - Natural Language Processing DBPedia Spotlight https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki Zemanta: http://www.zemanta.com/?wpst=1http://www.zemanta.com/?wpst=1 Open Calais: http://www.opencalais.com/http://www.opencalais.com/ Open Refine: http://openrefine.org/http://openrefine.org/ DataTXT: https://dandelion.eu/products/datatxt/https://dandelion.eu/products/datatxt/ AlchemyAPI: http://www.alchemyapi.com/http://www.alchemyapi.com/ FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzyhttps://github.com/seatgeek/fuzzywuzzy
17
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
19
Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts
20
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Linked Jazz Back End
21
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Primo PNX and Authorities Indexing Cross References New Browse Functionality Authority Control from Aleph / Alma What about non-MARC, or non- Aleph Data? Matching Strings to Authorities
22
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Enter Open Refine http://freeyourmetadata.org/
23
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Match strings to vocabularies…
24
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Like LCNAF…
25
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Or Wikipedia
26
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Automated Authority Control?
27
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
28
Open Refine RDF Skeleton
29
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
30
Proposed System Architecture
31
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Hydra Modeling & Architecture Approaches to Provenance Prov-O Named Graphs Named Datastreams “n” nyucore “records” Same properties defined for each Keep data sources separate Merge for display in Blacklight & export to Primo
32
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Separate Metadata Datastreams source_metadata, enrich_metadata Reload one or both without affecting other or native metadata native_metadata Edited only through Hydra UI Partitioned from external sources
33
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Metadata Provenance
34
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Fedora Datastreams
35
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Blacklight User Interface
36
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts
37
A Role for Ex Libris Alma &/or Primo Named Entity Recognition Vocabulary Reconciliation Provenance Management Primo Central Named Entity Recognition on Full Text Auto Classification
38
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 A bit louder... we need new interfaces we need enterprise tools Integrated into our metadata management systems for new kind of catalogers for knowledge organization experts
39
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Simplified Workflow Proposal
40
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 More Tools – At Programming Level Open NLP: https://opennlp.apache.org/https://opennlp.apache.org/ Stanford Natural Language Toolkit: http://nlp.stanford.edu/software/index.shtml http://nlp.stanford.edu/software/index.shtml Python Tools SciKitLearn, Pandas, NLTK, SciPi, NumPi https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience http://pandas.pydata.org/ http://www.nltk.org/
41
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 More Data Science-ey Tools http://www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html
42
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Data Science Techniques Feature Extraction / Feature Engineering Predictive Modeling Probabilistic Classification – Large Multi-Class Problems Text Analytics Vectorization Bags & Sets of Words TF/IDF N-Grams Sparse Matrices
43
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Simple Example – Predict Yelp Star Ratings
44
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Fitting a Model – Naïve Bayes
45
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Data Science Venn Diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
46
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323
47
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Where can we go from here? NER is just the beginning Feature Engineering Hiring Statisticians Clustering & Classification Vocabulary Pruning and Engineering Manageable 10-20k Class Text Classification Problems Domain Specific Ex Libris’ Activity in this space
48
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Thanks! corey.harper@nyu.edu 212.998.2479 @chrpr
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.