Download presentation
Presentation is loading. Please wait.
Published byJosephine Lee Modified over 9 years ago
1
Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry Mozzherin Marine Biological Laboratory / Global Names @dpsSpiders, @dimus
2
Biota of Canada http://biologicalsurvey.ca
3
We want to find & then organize data from printed materials but search is exasperatingly limited
4
15,000 OCR articles & their scanned images (9GB)
5
Key Players
6
Global Names http://gnrd.globalnames.org http://resolver.globalnames.org
8
Named Entity Extraction people, companies, organizations, cities, geographic features
9
elasticsearch
10
http://canent.shorthouse.net
11
https://github.com/dshorthouse/article_semanticizer
12
Search Characteristics Tokenizers: path hierarchy Filters: edge Ngram, pattern replace (abbreviated genera), stemmer (English), elisions (French) Analyzers: lowercase, ascii folding, autocomplete Full text Thanks to: Christian Gendreau (Canadensys)
13
Possible Next Steps Generalize the design to best support content types (eg specimen labels) Better recognition of other entities, text blocks Scientific name plugin for elasticsearch (hackathon?) Share with Journal Map and Mining Biodiversity Engage scientific societies, journals
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.