PANACEA - Y2 After the 2 nd Annual Review, 28 th February 2012, Barcelona 1
Join together a number of advanced interoperable tools to build a platform/factory/production line that automates the stages involved in the –acquiring, processing and producing Language Resources required by MT and other Language Technologies Objectives
Partners WP1 – Management (UPF) WP3 – The Platform (UPF) WP4 – Corpus Acquisition & Annotation (ILSP) WP5 – Parallel corpus & derivatives (DCU) WP6 – Lexical Acquisition (UCAM) WP7 – Integration & resource evaluation (ILC) WP8 – Evaluation in industrial environment (LT) WP2 – Dissemination and Exploitation (ELDA)
Platform The PANACEA platform is an interoperability space based on tools, guidelines, a Common Interface definition, and a “Travelling Object” specification Tools: Taverna, BioCatalogue, myExperiment, Soaplab Common Interface: WS interoperability Travelling Object: XCES and GrAF Documentation (video tutorials, how-tos, deliverables, etc. at 4
Tools SOAPLAB 2 (SOAP) - Web application for deploying command line tools as WS - No coding needed! Metadata only - Services deployed by ILSP at Web application for deploying command line tools as WS - No coding needed! Metadata only - Services deployed by ILSP at TAVERNA - Open source desktop application - Imports Soaplab and other types of WS - Allows for combination of WS in workflows ( - Open source desktop application - Imports Soaplab and other types of WS - Allows for combination of WS in workflows ( BioCatalogue -Web application for registering and documenting WSs -Search function - Auto-checks web services status - Annotations: tags, categories, etc. -Web application for registering and documenting WSs -Search function - Auto-checks web services status - Annotations: tags, categories, etc. Web Services Workflow editor Registry Social network myExperiment - Share workflows, files, data, etc. - Share opinions and comments, create work groups, etc Share workflows, files, data, etc. - Share opinions and comments, create work groups, etc
Three levels of interoperability: –COMMUNICATION PROTOCOLS: Soap, Rest –DATA –PARAMETERS Format N Tool A Format M Tool B Format L Tool C Format N Tool A empty Tool B empty Tool C Interoperability Tool B does not “understand” format N! All tools understand the previous format Tool A Tool B ABCDABCD ABCDABCD Tool A Tool B YTQZYTQZ ABCDABCD 6
Travelling Object The Travelling Object (TO) is the common data and metadata format used in PANACEA to make components understand each other (syntactic interoperability) First TO for annotations up to tagging and lemmatization –Based on XCES (XML files with p, s, and t elements) –Tools: formatConverters and stylesheets Second TO for everything else (NER, DepParsing, etc.) –Based on GrAF (standoff annotation) –One file for primary data –One file for each annotation layer 7
Common Interface A Common Interface (CI) defines the mandatory parameters for every type of WS: 8
Soaplab Web Services 28 Corpus Acquisition and Annotation Web Services NLP WS’s focusing on sentence splitting, tokenization, tagging, lemmatization and parsing, e.g: –EN, FR: Berkeley tagger and parser (DCU) –ES: UPF tools, Freeling; IT: ILC’s DESR, Freeling –DE and EL: LT’s and ILSP’s in-house tools WS’s for conversion from and to PANACEA’s Travelling Object and ILC) WS’s for alignment of parallel data
10 Corpus Acquisition WS Focused Bilingual Crawler (FBC) –Documentation: –Test at –Sample topic definition for crawling EN-FR pages in the Environment domain xt xt –Seed URL for crawling EN-FR ENV data Focused Monolingual Crawler (FMC) –Documentation: –Test at –Topic definition for crawling EN ENV data txt txt –List of seed URLs for crawling EN ENV txt txt
11 Taverna Workflow Demo How can I align crawled data? Search for a DCU hosted alignment service at ry=alignhttp://myexperiment.elda.org/workflows?que ry=align
12 Corpus Annotation WS ILSP –Documentation: –Test at –Sample input: ILC DESR (dependency parser) –Workflow: