TextCrowd – Collaborative semantic enrichment of text-based datasets

TextCrowd – Collaborative semantic enrichment of text-based datasets
Kathrin Beck (MPG/MPCDF) EOSC Service Provider Workshop 2017/09/13, Pisa

Agenda Introduction & aims of TextCrowd
Progress report and current status Challenges Further work Success criteria EOSCpilot workshop, Pisa 13/09/2017

TextCrowd team Franco Niccolucci, Achille Felicetti (PIN / Florence Univ.) Kathrin Beck, Thomas Zastrow (MPG/MPCDF; shepherds) EOSCpilot workshop, Pisa 13/09/2017

Introduction Fragmented research landscape in the Social Sciences and Humanities EOSC: structure and integrate initiatives such as CLARIN, DARIAH and E-RIHS ERICs, and Digital Humanities Organizations (e.g. their Association ADHO) Offer advanced text-based services addressing common research needs. Benefit for many scientists in the long-tail even if delivering such a service presents real challenges around interoperability and multilingualism. EOSCpilot workshop, Pisa 13/09/2017

Introduction TextCrowd
Cultural heritage and humanities datasets are largely based on texts: Reports Archaeology: excavations, surveys Conservation: diagnosis, restoration – often mixed with numeric results Grey literature Literary/historical sources Research articles Monographs Aim: Linguistic annotated texts Machine learning models for natural language processing (NLP) tool chains Automatic annotation and information extraction via NLP tools Size: the demonstrator will work with data in the range of megabytes, later extensible up to 2 million files EOSCpilot workshop, Pisa 13/09/2017

Progress report & current status
Named Entity (NE) categories: Artefact Colour Material Time period Person Place Site Timespan Technique Target output formats / ontologies: RDF (Resource Description Framework, by W3C) CIDOC CRM (ICOM's International Committee for Documentation – Conceptual Reference Model, by World Museum Community) EOSCpilot workshop, Pisa 13/09/2017

GATE pipeline (desktop)
EOSCpilot workshop, Pisa 13/09/2017

NLP Tools GATE toolchain (https://gate.ac.uk/)
GATE pipeline has been refined and further developed: importing vocabularies and some pre-processing of their content replacing the Italian OpenNLP with FP7 project OpeNER components via web service calls from GATE, with resulting improvement in NER discovery OpeNER: neuronal network instead of OpenNLP maximum entropy model checking OpeNER outcomes refining stemming/lemmatization components developing part of speech (POS) rules for filtering on nouns when annotating specialised timespan and period component with pattern based rules. EOSCpilot workshop, Pisa 13/09/2017

GATE tools (desktop) EOSCpilot workshop, Pisa 13/09/2017

D4Science VRE Operated and maintained by CNR-ISTI on the D4Science VRE
Workflow engine with GATE pipeline, operated as RESTstyle web services (running in Sheffield) Intuitive, web based user interface User management Storage (private and shared files) EOSCpilot workshop, Pisa 13/09/2017

D4Science Dashboard EOSCpilot workshop, Pisa 13/09/2017

Storage EOSCpilot workshop, Pisa 13/09/2017

Pipeline EOSCpilot workshop, Pisa 13/09/2017

Challenges No annotated text corpora as training data for machine learning algorithms available Manual annotation of 400 pages of Italian archaeology reports in progress (current status: 200 pages of annotation) No user friendly Cloud-based environment available Desktop GATE pipeline migrated into D4Science AAI issues The pilot focuses on freely available texts User management within D4Science EOSCpilot workshop, Pisa 13/09/2017

Further Work Finishing the manual annotation of training text corpora
Training of machine-learning based NE applications Integration of the improved OpeNER recogniser into the D4Science GATE pipeline EOSCpilot workshop, Pisa 13/09/2017

Success criteria Creation of a text corpus for annotation
When is it big enough? When is the output good enough, and which text types are most relevant?  focus on reasonable quality for most common text types in contemporary Italian Interoperability of tool pipeline GATE offers all necessary tools, except for Italian NER  interoperability provided by GATE developers Interoperability between TextCrowd’s toolchain and other SSH workflow system like WebLicht (  not a focus now Performance enhancements User-friendly Cloud-based named entity recognition (NER) workflow for Italian archaeologists EOSCpilot workshop, Pisa 13/09/2017

Thank you for your attention! Any questions?
EOSCpilot workshop, Pisa 13/09/2017

TextCrowd – Collaborative semantic enrichment of text-based datasets

Similar presentations

Presentation on theme: "TextCrowd – Collaborative semantic enrichment of text-based datasets"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TextCrowd – Collaborative semantic enrichment of text-based datasets

Similar presentations

Presentation on theme: "TextCrowd – Collaborative semantic enrichment of text-based datasets"— Presentation transcript:

Similar presentations

About project

Feedback