Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry Mozzherin Marine Biological Laboratory /
Biota of Canada
We want to find & then organize data from printed materials but search is exasperatingly limited
15,000 OCR articles & their scanned images (9GB)
Key Players
Global Names
Named Entity Extraction people, companies, organizations, cities, geographic features
elasticsearch
Search Characteristics Tokenizers: path hierarchy Filters: edge Ngram, pattern replace (abbreviated genera), stemmer (English), elisions (French) Analyzers: lowercase, ascii folding, autocomplete Full text Thanks to: Christian Gendreau (Canadensys)
Possible Next Steps Generalize the design to best support content types (eg specimen labels) Better recognition of other entities, text blocks Scientific name plugin for elasticsearch (hackathon?) Share with Journal Map and Mining Biodiversity Engage scientific societies, journals