Download presentation
Presentation is loading. Please wait.
Published byGary Hampton Modified over 9 years ago
1
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University
2
Background Huge collection of Technical Reports in electronic form Where are all the reports authored by John ? Problem: Locating specific documents from a large collection. For example, locating all technical reports authored by “John” from a huge collection.
3
Issues Problem ? Reports are stored on a computer storage device like the hard disk. Each report could have been created by a word processor like Microsoft Word. We want computer to retrieve documents based on user query (in our example, all documents authored by “John”). In other words, can we write a program that goes through all the reports and identify the one’s written by John ? For this, to start with, we need we need a program that can extract the text out of a word file --- Let us say if we can do that, how do we locate which string of text represents “Author’s” name in the extracted text? Can humans extract the author’s name by looking at the technical report ? Yes, Why ? Because computers are dumb and humans have intelligence ? The visual cues in the formatting of the report along with the accumulated knowledge help to identify a string as a name.
4
Sample Document title authors Authors affiliation abstract introducti on
5
Approach How about putting explicit markups to indicate what a particular string of text is in the document ? For example, store the sample document in a plain text file with markups rather than a Word file. Sample document as a plain text file with markups Issues with Designing Large Scale Libraries Based on NCSTRL K. Maly, M. Zubair, H. Anan, D. Tan, and Y. Zchang “ ” is a markup to indicate the start of the title string “ ” is a markup to indicate the end of the title string “ ” is a markup to indicate the start of the authors string ” is a markup to indicate the end of the authors string What happened to the presentation format (for printing etc.) ? We will address this issue in a short while, for time being we will ignore the presentation format
6
Approach Once we have all documents as plain text with markups, it is easy (relatively !) to write a program which goes through all the documents and identify documents satisfying user query. For example, to find all documents authored by “John”: Go over all the documents and check for string “John” between the start and end of Authors markup (that is between and ) Summary By inserting markups in a documents, we identify its various components, referred also as elements. The number and types of elements in a document are determined by what kind of searches one need to support. For example, if a user is going to search for documents authored by specific authors, the document should have element type “authors”, and so on. Please note that this partitioning of document into various components was though driven by the need for smart searches, it creates an abstraction of the document from which other versions such as suitable for display purposes can be created.
7
Presentation Sample document as a plain text file with markups Issues with Designing Large Scale Libraries Based on NCSTRL K. Maly, M. Zubair, H. Anan, D. Tan, and Y. Zchang ……………………. For formatting purpose, a separate set of instructions, collectively known as stylesheet, are used to tell the computer how to render an abstract document with markups. For example, in the formatted presentation, we can make the content of Title element bold, large and centered. The computer convert the document from the abstraction (structured document with markups) to a formatted rendition based on a given stylesheet. It is possible to have different stylesheets associated with a document for the purpose of different formatted presentation. By abstracting document, we represent the document in a form from where we can generate different rendered versions by applying different stylesheets. NOTE: It is possible that a rendered version of the document also uses markups for formatting purposes. Markup languages can represent both abstractions and renditions.
8
Need for Rules Assume that at an organization, which manages a huge collection of technical report in electronic form decided to have following element types based on their needs. Title, Authors, Affiliation, and Abstract A typical abstraction of such a document will look like this. Issues with Designing Large Scale Libraries Based on NCSTRL K. Maly, M. Zubair, H. Anan, D. Tan, and Y. Zchang Department of Computer Science, Old Dominion University NCSTRL is an unified ….. What happens when a user creates a report, which does not have all the elements properly filled. For example, by mistake user left out the authors. In such a case this particular document will not be retrieved for queries where it should have been. There should be a mean of validating the document before it can be accepted by the computer. How does the computer decide that a document is valid or not. For this we need to have a formal definition of rules that describes the document type (in our example technical report).
9
Document Type Different communities have different types of documents with different abstractions. For example, lawyers deal with documents that are different from the ones scientist works on. Sample markups used in a court document (http://www.courtxml.org) Attorney for the Plaintiff Margret Marly Jefferson Margret Marly Jefferson 8200291 New Mexico 1982 Active
10
Document Type There is a need to define what it means for each type of document to be valid. The rules describing the validity of a document of court type document will be different from rules describing the validity of a document of technical abstract type document. The formal definition that describe these rules for each type is called (Document Model) Document Type Definition (DTD) Summary Different communities will develop their own vocabulary (set of element types) and will describe constraints in terms of these element types on the document formation.
11
Summary Inserting markups help in abstracting the document that is giving an abstract structure and thus making it feasible to process documents electronically for handling user queries. Element refers to various components of a document. Separating abstract structure from presentation is good for supporting several different presentation formats for the same document. Different communities requires different set of markups that describes various elements of a document of their type.
12
What is XML? Extensible Markup Language allows a community to define their markups to describe various elements of the document along with constraints on the use of these markups. A document with markups that follow XML guidelines is referred as XML document. Markup (Tag) describes an element of a document. Document Type Definition (DTD) is a file that contains set of rules for using XML to represent documents of a particular type Stylesheet is a file that contains set of rules for generating formatted renditions from the XML abstractions XML is a subset of SGML ( a much older standard) -- the main difference being DTD is not required for a XML document, but you can use it if you have one.
13
Java and XML Complete portable solution XML: Data portability Java: Code portability Example: Business to Business application Support on Internet for suppliers for a large retailer Data movement between the supplier and the retailer can be handled using XML enabling it for machine processing, and can be easily validated. XML also allows the use of a standard XML parser, which would not have been possible if using proprietary data formats. The code developed in Java to manage XML data can be moved to different platform, is Internet enabled, and support support development of internationalized applications (unicode support)
14
Simple Example Original document Hello World Original document with semantic markups Hello World DTD describing the rules for markups used // element greeting contains salutation and addressee // element salutation contains character data // element addressee contains character data NOTE: Syntax for describing DTDs will be covered later.
15
XML Document for the Simple Example Hello World Contents of greeting.dtd file Instance of a document class identified by DTD (greeting.dtd) Prolog identifies the version of xml and the document type to which the xml document conforms
16
Another Example XML DOCUMENT Issues with Designing Large Scale Libraries Based on NCSTRL K. Maly, M. Zubair, H. Anan, D. Tan, and Y. Zchang Department of Computer Science, Old Dominion University NCSTRL is an unified ….. Contents of report.dtd file
17
Well Formed XML Documents Every start tag has a counterpart end tag and all elements compose a tree. Example (Use IE to view the file) 0000002150 SMITH AUBREY HILLSIDE DR PO BOX 2134 75034 Old Town
18
Conclusion XML technology is playing a central role in building digital libraries.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.