Download presentation
Presentation is loading. Please wait.
Published byShannon Anthony Modified over 9 years ago
1
Unit no. 4 Mark-up Adolf Knoll National Library of the Czech Republic adolf.knoll@nkp.cz
2
Learning objectives After the completion of this unit the learner will be able to: Understand what to do with the digital output for further use Understand the basics of the mark-up languages, especially XML Have a basic orientation in their application to be able to make correct decisions for building a digitization project
3
Production of a digital document Digital document Original document Digitization Description Data Metadata
4
What do we produce? Data direct product of digitization: digital images, full text, video & audio files usually a set of files that represent the original document Metadata added value through textual information they express: identification with the original structure and links to data files technical information about data accessibility administrative matters etc.
5
Mark-up Created because of a need to store additional (hidden) information in text in order to: better format it when displayed and/or printed = prescriptive mark-up classify parts of it as objects relevant to various rules of description such as cataloguing rules, rules of providing technical parameters, various good practices, rules of associating them with their visual representation, etc. = descriptive mark-up
6
Mark-up For example, in MS Word the paragraph is marked with a ¶ In the HTML code the paragraph is marked with paragraph In the HTML code the paragraph is marked with paragraph In HTML the bold text or the break of the line are marked as follows: This is an HTML document, which consists of elements. of elements. All this is procedural (prescriptive) mark-up. Mind the use of <> brackets to start with and end with the marked-up element. The paragraph is marked with ¶ Paragraph¶
7
Objects The markup marks: OBJECTS Which objects? THOSE, WHICH WE DEFINE AS OBJECTS On which basis do we define them? On the basis of CERTAIN RULES How the rules are establish? On the basis of an agreement; they are usually a written (even published) document specifying the objects that should be followed and described. Examples: AACR2 Cataloguing Rules in libraries, ISBD rules, CDWA or AMICO description rules for museum objects, Data Dictionary for Still Digital Images, etc. The description rules do not define how the objects are marked up – this is done via a mark-up formal language The most sophisticated mark-up approach is SGML
8
General markup language SGML Standard Generalized Markup Language (ISO standard from 1986) is the base for other derived approaches that may be called mark-up languages of the 2nd generation: HTML (prescriptive) TEI … XML (descriptive) The markup language marks the object without assigning any kind of behaviour to it. Its behaviour is prescribed by an independent rule.
9
How does it work? the main construction unit of an SGML-based mark-up approach is called ELEMENT each element must be defined by an external content descriptive rule; e.g. a cataloguing rule (AACR2 or another one) defines the element Title; it may also define the sub-elements such as Main Title, Parallel Title, or Sub-Title, etc. it results there may be hierarchical relationships between elements (parents with children)
10
How to define the metadata standard? We need formal rules to express the content descriptive standards In SGML environment, this is done in the Document Type Definition (DTD) DTD can, among others, do the following: List all the elements and set up their properties (mandatory, non-mandatory, repeatable etc.) Define relations between elements Refine their attributes, e.g. through a list of permitted values Point from them to external entitities, i.e. other definitions or binary data, e.g. digital images
11
If we take as example that we need a description element author, then: Formal rule for display of the element author formal definition of the element author Content definition of the element author description rules / e.g., AACR2 rules for formal definition / e.g., DTD rules of transformation for display / e.g., XSLT for XML is given by In this way, we work in XML
12
XML eXtensible Markup Language XML file *.xml It contains the reference to the DTD that controls it It can contain the reference to the transformation rule that formats it for display, e.g. a XSLT file DTD *.dtd DTD for XML is still written in SGML syntax; therefore, a W3C Schema has been introduced to replace it. Like this, a document can be controlled either by a DTD (*.dtd) or by a Schema (*.xsd). *.xslt
13
DTD = Document Type Definition The basic construction piece is ELEMENT ELEMENT can have a content or it can be EMPTY ELEMENTS can consist of other elements
14
Here the element Title consists of a group of three elements (MainTitle, SubTitle, and ParallelTitle); from them only the MainTitle is mandatory, SubTitle and ParallelTitle are not, while ParallelTitle can be repeatable. In a DTD it is written like this:
15
The element PageRepresentation enables to link the concrete page with the image or full text that represent it. <!ATTLIST MonographPage Type (Advertisement | BackCover | BackEndSheet | Blank | FlyLeaf | FrontCover | FrontEndSheet | Index | ListOfIllustrations | ListOfMaps | ListOfTables | NormalPage | Spine | Table | TableOfContents | TitlePage) "NormalPage" > <!ATTLIST PageImage href CDATA #REQUIRED > <!ATTLIST PageText href CDATA #REQUIRED > To note: we can also set up a list of attributes; here these are Type of the MonographPage or href, i.e. reference to external data entity.
16
<!ATTLIST MonographPage Type (Advertisement | BackCover | BackEndSheet | Blank | FlyLeaf | FrontCover | FrontEndSheet | Index | ListOfIllustrations | ListOfMaps | ListOfTables | NormalPage | Spine | Table | TableOfContents | TitlePage) "NormalPage" > The above part of a DTD means this: The element MonographPage consists of the elements PageNumber, Notes and PageRepresentation. We classify the MonographPage in relationship to its content into the Types such as Advertisement, BackCover, …, TableOfContents, and TitlePage. We have set up the defaulf value as NormalPage, because we expect this will be the most frequent choice. The meaning of the qualifying signs is as follows: Element - lack of sign = the element is mandatory and it occurs only once Element+ - the sign + = the element is mandatory and occurs at least once Element? - the sign ? = the element is not mandatory and it can occur only once Element* - the sign * = the element is not mandatory and it occurs at least once
17
<!ATTLIST PageImage href CDATA #REQUIRED > <!ATTLIST PageText href CDATA #REQUIRED > Each element that does not consist of any further elements must be defined, too. The expression (#PCDATA) announces that in the XML files written on the basis of this DTD, an analyzable string of metadata is expected, here, for example, a page number like this 221 The sign | in (PageImage | PageText) indicates that only one of the two elements is applied for the concrete PageRepresentation. The philosophy of this DTD shows that in case of the page representation both by image and text, each of them is attached to a new PageRepresentation. The ATTLIST (list of attributes) sets up the href attribute as a reference/navigation link to non-analyzable external data (CDATA). The elements PageImage and PageText are empty as they serve only to link the page to the image or full text files.
18
2 List of publications of U. Eco at Bompiani This is a concrete section from an XML file, where we can see that the reference is made to the image in GIF format located in the Data subdirectory. We can also see that it is the page no. 2 of the Type Flyleaf. For more understanding, we will now make a simple project whose aim is to write a DTD for the document we may need in a project of digitization of old postcards. The steps are: analysis of the document, establishment of needed elements and their relationships, setup of the element linking to digitized images, writing the DTD, writing an XML file based on the DTD, and its display. The aim is to show how it is done, not to teach everything as it requires a more thourough XML training course.
19
How to write a simple DTD? 1. Analyze well the object you wish to describe and represent 2. Try to establish the necessary elements for description and their basic properties (mandatory yes/no, repeatable yes/no) 3. Try to define whether these elements will consist of other elements 4. Establish from which elements the visual image files will be referenced to
20
Digitized postcard Root element: PostcardDescription Elements of the 2 nd level: author (consists of surname and name elements) title theme publisher (consists of PlaceOfPublication, NameOfPublisher, DateOfPublication) PhysicalDescription (consists of Size and Technique elements) TypeOfDocument VisualRepresentation (consists of ImageOfRectoPart and ImageOfVersoPart elements) language annotation The necessary elements and hierarchies for a DTD of a Digitized Postcard
21
They can be represented by this graph
22
<!ATTLIST ImageOfRectoPart (preview | normal | excellent) #REQUIRED CDATA #REQUIRED > <!ATTLIST ImageOfVersoPart (preview | normal | excellent) #REQUIRED CDATA #REQUIRED > Postcard.dtd
23
Lyer Antonín Hronov views of streets Nádražní ulice Dvorská ulice Jiráskova ulice Náměstí Hronov Karel Šefelín [1910] 9x13 cm colour printing postcard cz The postcard was sent by my great grand-mother to her husband, who was in military service in first years of the World War I. Postcard.xml Reference to a formatting stylesheet Reference to image files
24
How does it work in a web browser? When we click on the xml file: The browser will look for the formatting file (stylesheet – the *.xslt file) and will call it It will display the file following the prescribed rules We can click on the links leading to images that represent the postcard visually and we will be navigated to them So, let’s try it and click on the file Postcard.xml Postcard.xml
25
XML Conclusions The language enables to define and control any type of descriptions It can relate them to the outer data It makes the structure of the digitized documents clear and readable for the long term It enables that the output of our work (production of XML files and digitized documents) corresponds with what we defined we wished to do It means that for example our Digital Library can be fed by correct and standardized documents that enable, among others, also their long-term digital preservation
26
Work with XML From the user perspective a good digitization project develops XML editors that: make the work easy (filling forms) check the validity against the applied DTD output only correct XML structures If you wish to check your forces, dowload the free M-TOOL from the Manuscriptorium Digital Library free tools at http://manuscriptorium.com/Site/ENG/mtool_eng.asp and try to work with it http://manuscriptorium.com/Site/ENG/mtool_eng.asp
27
Where to find more? General http://www.w3.org/XML/ (XML Home) http://www.w3.org/XML/ http://www.xml.com/pub/a/98/10/guide0.html (Technical Introduction to XML) http://www.xml.com/pub/a/98/10/guide0.html http://www.altova.com/ (XMLSpy editor) http://www.altova.com/ Applied http://digit.nkp.cz/techstandards.html (several DTDs implemented in functioning digital libraries) http://digit.nkp.cz/techstandards.html http://www.loc.gov/standards/mets/ (METS format for containerization of XML-based digital documents) http://www.loc.gov/standards/mets/ http://www.tei-c.org/ (TEI – Text Encoding Initiative) http://www.tei-c.org/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.