CCCT-041 Semantic Extensions to Domain- Specific Markup Languages Aparna Varde, Elke Rundensteiner, Murali Mani, Mohammed Maniruzzaman and Richard D. Sisson Jr. Worcester Polytechnic Institute (WPI) Worcester, Massachusetts, USA
CCCT-042 Introduction XML, the eXtensible Markup Language: Widespread standard in storing and publishing data. Domain-specific markup languages designed with XML tag sets. Standardization bodies extend these to include additional semantics. Aspects such domain knowledge, XML constraints are important. Focus of Paper: Generic issues in extending markup languages.
CCCT-043 Domain-specific markup language Medium of communication for potential users of the domain. Users: industries, consumers, universities, research organizations, publishers etc. Follows XML syntax. Encompasses the semantics of the domain. Examples MML: Medical Markup Language MatML: Materials Science Markup Language Markup Language Industries Consumers Universities Research Organizations Publishers
CCCT-044 MML: Medical Markup Language Creates standards for medical data to be stored and accessed worldwide. MML module contents, e.g., “basic clinic information”, “surgery record information”. Used by primary care physicians, general surgeons etc. Specific information in sub-areas such as “opthalmology” cannot be stored with these modules. Thus there is need for more semantics in MML.
CCCT-045 Motivation for extension to markup languages Analogous to medical domain and opthalmology there are specifics in other domains. Why not define a new markup language for each aspect? –Typically basic information in generic language that needs cross-referencing, e.g., basic surgical details in opthalmology. –Common information should not be stored twice. Advisable to extend existing markup language with additional semantics.
CCCT-046 Extending the Materials Science Markup Language, MatML MatML: Materials Science Markup Language. XML for materials property data. Heat Treating: controlled heating and cooling of materials to achieve desired mechanical and thermal properties. Need to include semantics of Heat Treating in MatML. At WPI, Heat Treating extension to MatML is proposed. Several issues, domain-specific and XML-related crucial here. …………… ……………... ………………….
CCCT-047 General issues in extending any markup language Steps essential in markup language extension. Desired language features. XML schema constraints. Retrieval using XQuery.
CCCT-048 Steps essential in markup language extension 1.Understand domain semantics. 2.Model the data. 3.Conduct interviews. 4.Define the ontology. 5.Reiterate the ontology. 6.Outline the initial schema. 7.Revise the schema based on critical reviews.
CCCT Understand domain semantics Acquire domain knowledge: terminology, processes, entities etc. This helps determine essential tags to store data in the domain. Study existing markup language in detail. This is to understand where exactly it needs extension.
CCCT Model the data Build data model after studying domain. Use techniques such as Entity-Relationship diagrams. Thus represent domain entities, their properties and relationships. Subset of E-R Diagram for Heat Treating
CCCT Conduct interviews Needs of potential users are important. This helps determine entities and attributes in extension. Users: industries, universities, research organizations, publishers etc. Domain experts can identify needs of users. Hence, interview the domain experts.
CCCT Define the ontology Ontology serves as established lingo for the domain. Hence defining ontology is important to proceed with design. Issues Synonyms: two or more words with same meaning, e.g., in financial domain, “salary” and “income”. Homographs: one word with multiple meanings, e.g., “share” in financial domain could refer to “sharing of assets” or “shares in the stock market”. Clarify such terms with reference to context through ontology.
CCCT Reiterate the ontology Once ontology established, useful to have another round of discussions with experts. Additional discussions with domain experts may lead to further clarifications. –Example: remove existing entities, create new ones, based on terminology. Accordingly ontology needs to be altered. Use this ontology for schema design. High-level ontology for Heat Treating
CCCT Outline the initial schema Schema provides structure, i.e., defines grammar for the markup language. Once data model and ontology are approved by domain experts, outline the initial schema. Adhere to the syntax of original markup language to be accommodated as extension. Partial snapshot of schema for Heat Treating extension to MatML.
CCCT Revise the schema based on critical reviews Initial schema serves as medium of communication between designers and users. This is subject to further changes until domain experts are satisfied. Schema revision may involve several iterations. Some of these include discussions with standards bodies. For proposed extension to be accepted as worldwide standard, it must be approved by experts & standards bodies.
CCCT-0416 Desired language features 1.Avoid redundancy. 2.Make information non-ambiguous. 3.Provide easy interpretability of data. 4.Capture domain constraints in the schema.
CCCT Avoid redundancy Markup language extension should be such that duplication of storage is avoided. Data stored in the original markup language should be cross-referenced in the extension. Example –In medical domain, there should be cross-referencing between “basic clinic information” in the original language and “opthalmological details” in the extension. Schema should be structured accordingly.
CCCT Make information non-ambiguous Domain terminology, its semantics, aspects such as synonyms / homographs are significant. The schema design should adhere to the ontology to avoid ambiguity. Annotations should be included within the schema to enhance clarity. Example: –For spectacle prescriptions in opthalmology, include meanings of terms “myope” and “hypermetrope” in schema as annotations.
CCCT Provide easy interpretability of data Data is stored using markup language tags. Readers should be able to interpret this data without much reference to the literature. Thus the schema design should be organized accordingly. Example: –In science and engineering domains, experimental conditions should be stored close to results to enhance readability.
CCCT Capture domain constraints in the schema Certain requirements imposed by the domain need to be captured in schema. Done through XML constraints feature. Some constraints –Primary key: To uniquely identify an entity. –Choice: To declare mutually exclusive elements. Example: In financial domain, a person could be either “insolvent” (bankrupt) or “asset-holder” but not both.
CCCT-0421 XML schema constraints 1.Sequence constraint. 2.Disjunction constraint. 3.Key constraint. 4.Occurrence constraint.
CCCT Sequence constraint To declare a list of elements in order. Enclose elements in tags. Example: –In Heat Treating extension, element “QuenchConditions” must occur before “Results”.
CCCT Disjunction constraint To declare mutually exclusive elements, i.e., only one of them can exist. Enclose elements in tags. Example: –In Heat Treating, a part can be made by “Casting” OR “Powder Metallurgy”, not both.
CCCT Key Constraint To declare an attribute to be a primary key, i.e., it must be unique and non-null. Indicate the attribute as type “xsd:ID” and its use as “required”. Example: –In Heat Treating, the name of the cooling medium (quenchant) is crucial because the purpose of the experiments is to categorize the quenchants.
CCCT Occurrence constraint To declare minimum and maximum permissible occurrences of an element. Indicate “minOccurs = x” and “maxOccurs = y” where “x” and “y” denote the minimum and maximum occurrences respectively. Value “maxOccurs = unbounded” means no upper bound on number of occurrences. Value “minOccurs = 0” means that element need not be stored even once. Example: –In Heat Treating, Cooling Rate must be recorded at a minimum of 8 points in an experiment and there is no upper bound for it. The maximum number of graphs stored per experiment is 3 and it is not necessary that at least one graph be stored.
CCCT-0426 Retrieval using XQuery 1.Encourage users to store data in a case- sensitive manner. 2.Use tags to enhance querying efficiency.
CCCT Encourage users to store data in a case-sensitive manner XQuery is case-sensitive Hence it is useful to place emphasis on case when storing data using markup language. This facilitates retrieval using XQuery.
CCCT Use tags to enhance querying efficiency It is possible to anticipate a typical user query in a domain. Thus advisable to add a level of abstraction for faster retrieval of information. Example: –In Heat Treating, a user is likely to retrieve name details of quenchant without its property details. –Hence place tags and around quenchant information. –Thus entire path of quenchant need not be traversed for name details. –This enhances querying efficiency.
CCCT-0429 Conclusions Aspects of extending domain-specific markup languages discussed here. These include motivation for extension, steps in extension, language features, XML constraints and retrieval considerations. Extension to MatML proposed at CHTE, WPI to include Heat Treating semantics. Paper summarizes general issues in extending domain-specific markup languages.
CCCT-0430 Acknowledgments Database Systems Research Group in Department of Computer Science at WPI. Quenching Research Team in Department of Materials Science at WPI. Center for Heat Treating Excellence and its member companies.