Natural Language Processing for LODLAM Presented at IGeLU 2014 by Corey A Harper A brief intro to machine learning & data science for Libraries
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Context Narrative Story telling The Library's story, and the Archives story, but also…
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Users’ stories Scholars' stories Adding context through recombinant metadata
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Scholars & Users Stories – Tim Sherratt Also:
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Library Authority Data “Include links to other URIs. so that they can discover more things.” Short of providing and linking to URIs, this *is* authority data. This is what our authority files are for.
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Linked data is about context authorities provide context and yet our controlled vocabs are nearly gone because the interfaces to them were broken
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
The Death of Browse Next-Gen Discovery Systems don't make use of Authority Control “Browse” was/is broken as a UI Design Rich data in Authorities, disconnected from narrative, context, search Richer “Authority” type data outside libraries... “Next Gen Next Gen Discovery…
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Fuzzy Wuzzy – Seat Geek Fuzzy Wuzzy – Awesome Library from SeatGeek
Slide courtesy of Doug Oard Univ. of Maryland
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Tools - Natural Language Processing DBPedia Spotlight Zemanta: Open Calais: Open Refine: DataTXT: AlchemyAPI: FuzzyWuzzy:
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Linked Jazz Back End
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Primo PNX and Authorities Indexing Cross References New Browse Functionality Authority Control from Aleph / Alma What about non-MARC, or non- Aleph Data? Matching Strings to Authorities
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Enter Open Refine
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Match strings to vocabularies…
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Like LCNAF…
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Or Wikipedia
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Automated Authority Control?
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Open Refine RDF Skeleton
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Proposed System Architecture
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Hydra Modeling & Architecture Approaches to Provenance Prov-O Named Graphs Named Datastreams “n” nyucore “records” Same properties defined for each Keep data sources separate Merge for display in Blacklight & export to Primo
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Separate Metadata Datastreams source_metadata, enrich_metadata Reload one or both without affecting other or native metadata native_metadata Edited only through Hydra UI Partitioned from external sources
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Metadata Provenance
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Fedora Datastreams
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Blacklight User Interface
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts
A Role for Ex Libris Alma &/or Primo Named Entity Recognition Vocabulary Reconciliation Provenance Management Primo Central Named Entity Recognition on Full Text Auto Classification
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 A bit louder... we need new interfaces we need enterprise tools Integrated into our metadata management systems for new kind of catalogers for knowledge organization experts
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Simplified Workflow Proposal
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 More Tools – At Programming Level Open NLP: Stanford Natural Language Toolkit: Python Tools SciKitLearn, Pandas, NLTK, SciPi, NumPi
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 More Data Science-ey Tools
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Data Science Techniques Feature Extraction / Feature Engineering Predictive Modeling Probabilistic Classification – Large Multi-Class Problems Text Analytics Vectorization Bags & Sets of Words TF/IDF N-Grams Sparse Matrices
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Simple Example – Predict Yelp Star Ratings
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Fitting a Model – Naïve Bayes
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Data Science Venn Diagram
Harper – IGeLU – NLP 4 LODLAM – Sept 16,
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Where can we go from here? NER is just the beginning Feature Engineering Hiring Statisticians Clustering & Classification Vocabulary Pruning and Engineering Manageable 10-20k Class Text Classification Problems Domain Specific Ex Libris’ Activity in this space
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Thanks!