Download presentation
Presentation is loading. Please wait.
Published byMilo Parker Modified over 8 years ago
1
Web Service Exchange Protocols Preliminary Proposal ISO TC37 SC4 WG1 2 September 2013 Pisa, Italy
2
Goal define a Web Service Exchange Protocol (WSEP) that specifies a terminology for a core of linguistic objects and features exchanged among NLP tools could be generalized to apply to data exchange among tools in general rather than web services alone
3
Considerations What are the linguistic objects and their features that constitute the input and output of language processing tools? How should they be represented for the purposes of interchange among web services?
4
Assumptions 1.The objects and features defined in the WSEP will be based as much as possible on current practice. 2.The WSEP will provide a core set of objects and features, on the principle that “simpler is better”. 3.The WSEP will provide for (principled) definition of additional objects and features beyond the core to represent more specialized tool input and output. 4.The design of the WSEP will not be governed by the processing requirements or preferences of particular tools, systems, or frameworks.
5
Assumptions (con’t) 5.The WSEP will be required only for interchange among web services performing NLP tasks. As such it can serve as a “pivot” format to which user and tool-specific formats can be mapped. 6.The WSEP will be designed so as to allow for easy, one-to-one mapping from terms designating linguistic objects and features commonly produced and consumed by NLP tools that are wrapped as web services. – not necessary for the mapping to be object-to-object or feature-to- feature. 7.The web service provider is responsible for providing wrappers that perform the mapping from internally-used formats to and/or from the WSEP.
6
Assumptions (con’t) 8.The WSEP format should be compact to facilitate the transfer of large datasets. 9.The WSEP format will be chosen to take advantage, to the extent possible, of existing technological infrastructures and standards.
7
Linguistic objects We can distinguish the following: – labels associated with regions of data (either directly or indirectly) that identify them as representing specific linguistic objects – features associated with these objects that provide additional information – links between objects that identify a relation between them
8
distinction between labels and features not fixed – “token” as the label (object) and “lemma” and “POStag” as features vs.“lemma” and/or “POStag” as annotation labels (objects) in their own right Labels and links can be interchanged – syntactic role in a dependency parse typically treated as a link label, but is an object label in a constituency parse
9
Current practice Early stages of processing (e.g., sentence splitting, tokenization, part of speech tagging) have a more or less stable set of objects associated with them – subset used varies from tool to tool Create some basis for development of the protocols by focusing on a simple pipeline that calls a service for data source delivery, sentence splitting, tokenization, part of speech tagging
10
I/O Formats
11
More complex descriptions UIMA type systems Several available at http://www.anc.org/LAPPS/EP/materials/
12
JULIE morpho-syntactic specs Annotation Token – Lemma Value – msd (list; subtypes: Pennmsd, Geniamsd, etc.) tagsetId Value – stemmedForm Value – feats (filled with corresponding GrammaticalFeats) – Orthogr Simplex (a morpheme of a compound) – Lemma Abbreviation (subtype: Acronym) – Expan – textReference – definedHere (for locally introduced abbreviations/acronyms) Sentence GrammaticalFeats subtypes: – NounFeats (gender, case, number) – VerbFeats (tense, person, number, voice, aspect) – AdjectiveFeats (degree) – PronounFeats (gender, case, number, person)
13
Straw man proposal
14
Data source types TextDoc AudioDoc VideoDoc WebDoc MetaData for data sources – DocID – Source – Language – Encoding – others, depending on type–see e.g., GaleDocTypes.xml
15
Annotation document types SentenceAnnotationDoc TokenAnnotationDoc posTaggedDoc...
16
Representation Proposal: – Allow only standoff – Use JSON-based Serialization for Linked Data (JSON-LD) Lightweight, text-based, language-independent data interchange format Defines a small set of formatting rules for representing of structured data JSON-LD provides extension to JSON to interpret as Linked Data A large number of JSON parsers and libraries available
17
JSON-LD @context property provides definitions of names used in the JSON structure
18
JSON-LD Pre-defined data type for anchors
19
JSON-LD Provides a definition of tokenType
20
JSON-LD Metadata for the annotation document
22
Short version using @vocab
23
Links (edges)
24
JSON-LD Key property: ability to refer to online repositories of term definitions WSEP can specify a core set of terms designating objects and features less common objects/features returned by a producer can be suitably defined at a given IRI expect that the requirement to identify and clearly define objects and features would lead to an increasingly large set of pre-defined terms with IRIs – repositories such as ISOCat will house many of these definitions – tool-specific names and definitions that must be referenced as well
25
Implementation Each web service publishes two pieces of information – Provides: output objects/features – Requires: input objects/features – Methods provided by the service Results may be in the form of a discriminator vector of values that uniquely identify each object and feature included in the input or output Compatibility between service1 and service2 – established when the results of the requires method call to service2 is a subset of the results of the produces method call for service1 If a tool does not directly produce its results or consume its input in the form of JSON, the encapsulating service must provide a mapping to and from the input and output JSON realizations that can be used internally by the tool
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.