Presentation is loading. Please wait.

Presentation is loading. Please wait.

Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry.

Similar presentations


Presentation on theme: "Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry."— Presentation transcript:

1 Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry Mozzherin Marine Biological Laboratory / Global Names @dpsSpiders, @dimus

2 Biota of Canada http://biologicalsurvey.ca

3 We want to find & then organize data from printed materials but search is exasperatingly limited

4 15,000 OCR articles & their scanned images (9GB)

5 Key Players

6 Global Names http://gnrd.globalnames.org http://resolver.globalnames.org

7

8 Named Entity Extraction people, companies, organizations, cities, geographic features

9 elasticsearch

10 http://canent.shorthouse.net

11 https://github.com/dshorthouse/article_semanticizer

12 Search Characteristics Tokenizers: path hierarchy Filters: edge Ngram, pattern replace (abbreviated genera), stemmer (English), elisions (French) Analyzers: lowercase, ascii folding, autocomplete Full text Thanks to: Christian Gendreau (Canadensys)

13 Possible Next Steps Generalize the design to best support content types (eg specimen labels) Better recognition of other entities, text blocks Scientific name plugin for elasticsearch (hackathon?) Share with Journal Map and Mining Biodiversity Engage scientific societies, journals


Download ppt "Article Semanticizer – Stitching Data Mining Services Into a Standalone Search Appliance David P. Shorthouse Université de Montréal / Canadensys Dmitry."

Similar presentations


Ads by Google