XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.

Slides:



Advertisements
Similar presentations
CSCI N241: Fundamentals of Web Design Copyright ©2004 Department of Computer & Information Science Introducing XHTML: Module B: HTML to XHTML.
Advertisements

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
XML/EDI Overview West Chester Electronic Commerce Resource Center (ECRC)
1 XSLT – eXtensible Stylesheet Language Transformations Modified Slides from Dr. Sagiv.
 Fundamentals of Web Design.  Describe the history and theory of XHTML  Understand the rules for creating valid XHTML documents  Apply a DTD to an.
XSL XSLT and XPath 11-Apr-17.
© De Montfort University, XML – a meta language Howell Istance and Peter Norris School of Computing De Montfort University.
History Leading to XHTML
XML XML What XML is and what it means to me as a Computer Scientist By: Derek Edwards CS 376 March 10, 2003.
3 November 2008CIS 340 # 1 Topics To define XML as a technology To place XML in the context of system architectures.
Outline IS400: Development of Business Applications on the Internet Fall 2004 Instructor: Dr. Boris Jukic XML.
Tutorial 9 Working with XHTML. XP Objectives Describe the history and theory of XHTML Understand the rules for creating valid XHTML documents Apply a.
Creating a Well-Formed Valid Document. 2 Objectives Introducing XHTML Creating a Well-Formed Document Creating a Valid Document Creating an XHTML Document.
XML Primer. 2 History: SGML vs. HTML vs. XML SGML (1960) XML(1996) HTML(1990) XHTML(2000)
XML Technologies and Applications Rajshekhar Sunderraman Department of Computer Science Georgia State University Atlanta, GA 30302
September 15, 2003Houssam Haitof1 XSL Transformation Houssam Haitof.
Developing a Basic Web Page with HTML
Introducing XHTML: Module B: HTML to XHTML. Goals Understand how XHTML evolved as a language for Web delivery Understand the importance of DTDs Understand.
Introduction to XML This material is based heavily on the tutorial by the same name at
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
4/20/2017.
Chapter 12 Creating and Using XML Documents HTML5 AND CSS Seventh Edition.
Creating a Simple Page: HTML Overview
August Chapter 1 - Introduction Learning XML by Erik T. Ray Slides were developed by Jack Davis College of Information Science and Technology Radford.
Working with XHTML Creating a Well-Formed Valid Document.
XP Tutorial 9New Perspectives on Creating Web Pages with HTML, XHTML, and XML 1 Working with XHTML Creating a Well-Formed Valid Document Tutorial 9.
XP The University of Akron Summit College Business Technology Department Computer Information Systems 2440: 140 Internet Tools Instructor: Enoch E. Damson.
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
1 XML at a neighborhood university near you Innovation 2005 September 16, 2005 Kwok-Bun Yue University of Houston-Clear Lake.
Neminath Simmachandran
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
XML Technologies Surekha Akula
1 © Netskills Quality Internet Training, University of Newcastle Introducing XML © Netskills, Quality Internet Training University.
XML BIS4430 – unit 10. XML Origins Extensible Markup Language (XML) 1998 Inspired by Standard Generalized Markup Language (SGML) and HTML. SGML defines.
Learning Web Design: Chapter 4. HTML  Hypertext Markup Language (HTML)  Uses tags to tell the browser the start and end of a certain kind of formatting.
XML 1 Enterprise Applications CE00465-M XML. 2 Enterprise Applications CE00465-M XML Overview Extensible Mark-up Language (XML) is a meta-language that.
Intro. to XML & XML DB Bun Yue Professor, CS/CIS UHCL.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Tutorial 1: XML Creating an XML Document. 2 Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content.
Electronic Commerce COMP3210 Session 4: Designing, Building and Evaluating e-Commerce Initiatives – Part II Dr. Paul Walcott Department of Computer Science,
HTML Concepts and Techniques Fourth Edition Project 12 Creating and Using XML Documents.
How do I use HTML and XML to present information?.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
XML Extensible Markup Language
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
1 Introduction  Extensible Markup Language (XML) –Uses tags to describe the structure of a document –Simplifies the process of sharing information –Extensible.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
Web Technologies COMP6115 Session 4: Adding a Database to a Web Site Dr. Paul Walcott Department of Computer Science, Mathematics and Physics University.
1 Introduction to XML XML stands for Extensible Markup Language. Because it is extensible, XML has been used to create a wide variety of different markup.
1 Credits Prepared by: Rajendra P. Srivastava Ernst & Young Professor University of Kansas Sponsored by: Ernst & Young, LLP (August 2005) XBRL Module Part.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Understanding How XML Works Ellen Pearlman Eileen Mullin Programming the.
An Introduction to XML Paul Donohue May 8th 2002 Hotel Senator Zürich.
Introduction to XML By Manzur Ashraf (Shovon) Dept. of Computer Science & Engineering (BUET)
XML stands for Extensible Mark-up Language XML is a mark-up language much like HTML XML was designed to carry data, not to display data XML tags are not.
Internet & World Wide Web How to Program, 5/e. © by Pearson Education, Inc. All Rights Reserved.2.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Representing data with XML SE-2030 Dr. Mark L. Hornick 1.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. An Overview of XML Ellen Pearlman Eileen Mullin Programming the Web Using.
Basics of Web Based Computing. The Architecture The user’s system A Web Server What’s inside? Server software Apache or other Resources to be accessible.
XML The Extensible Markup Language (XML ), which is comparable to SGML and modeled on it, describes how to describe a collection of data. A standard way.
Principles of Programming Chapter 1: Introduction  In this chapter you will learn about:  Overview of Computer Component  Overview of Programming 
XP Tutorial 9New Perspectives on HTML and XHTML, Comprehensive 1 Working with XHTML Creating a Well-Formed Valid Document Tutorial 9.
Tutorial 9 Working with XHTML. New Perspectives on HTML, XHTML, and XML, Comprehensive, 3rd Edition 2 Objectives Describe the history and theory of XHTML.
April 20023CSG11 Electronic Commerce Markup languages John Wordsworth Department of Computer Science The University of Reading
 XML derives its strength from a variety of supporting technologies.  Structure and data types: When using XML to exchange data among clients, partners,
Introduction to XML Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:
Tutorial 9 Working with XHTML. XP Objectives Describe the history and theory of XHTML Understand the rules for creating valid XHTML documents Apply a.
XML Extensible Markup Language
XML QUESTIONS AND ANSWERS
Presentation transcript:

XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University

Background Huge collection of Technical Reports in electronic form Where are all the reports authored by John ? Problem: Locating specific documents from a large collection. For example, locating all technical reports authored by “John” from a huge collection.

Issues Problem ? Reports are stored on a computer storage device like the hard disk. Each report could have been created by a word processor like Microsoft Word. We want computer to retrieve documents based on user query (in our example, all documents authored by “John”). In other words, can we write a program that goes through all the reports and identify the one’s written by John ? For this, to start with, we need we need a program that can extract the text out of a word file --- Let us say if we can do that, how do we locate which string of text represents “Author’s” name in the extracted text? Can humans extract the author’s name by looking at the technical report ? Yes, Why ? Because computers are dumb and humans have intelligence ? The visual cues in the formatting of the report along with the accumulated knowledge help to identify a string as a name.

Sample Document title authors Authors affiliation abstract introducti on

Approach How about putting explicit markups to indicate what a particular string of text is in the document ? For example, store the sample document in a plain text file with markups rather than a Word file. Sample document as a plain text file with markups Issues with Designing Large Scale Libraries Based on NCSTRL K. Maly, M. Zubair, H. Anan, D. Tan, and Y. Zchang “ ” is a markup to indicate the start of the title string “ ” is a markup to indicate the end of the title string “ ” is a markup to indicate the start of the authors string ” is a markup to indicate the end of the authors string What happened to the presentation format (for printing etc.) ? We will address this issue in a short while, for time being we will ignore the presentation format

Approach Once we have all documents as plain text with markups, it is easy (relatively !) to write a program which goes through all the documents and identify documents satisfying user query. For example, to find all documents authored by “John”: Go over all the documents and check for string “John” between the start and end of Authors markup (that is between and ) Summary By inserting markups in a documents, we identify its various components, referred also as elements. The number and types of elements in a document are determined by what kind of searches one need to support. For example, if a user is going to search for documents authored by specific authors, the document should have element type “authors”, and so on. Please note that this partitioning of document into various components was though driven by the need for smart searches, it creates an abstraction of the document from which other versions such as suitable for display purposes can be created.

Presentation Sample document as a plain text file with markups Issues with Designing Large Scale Libraries Based on NCSTRL K. Maly, M. Zubair, H. Anan, D. Tan, and Y. Zchang ……………………. For formatting purpose, a separate set of instructions, collectively known as stylesheet, are used to tell the computer how to render an abstract document with markups. For example, in the formatted presentation, we can make the content of Title element bold, large and centered. The computer convert the document from the abstraction (structured document with markups) to a formatted rendition based on a given stylesheet. It is possible to have different stylesheets associated with a document for the purpose of different formatted presentation. By abstracting document, we represent the document in a form from where we can generate different rendered versions by applying different stylesheets. NOTE: It is possible that a rendered version of the document also uses markups for formatting purposes. Markup languages can represent both abstractions and renditions.

Need for Rules Assume that at an organization, which manages a huge collection of technical report in electronic form decided to have following element types based on their needs. Title, Authors, Affiliation, and Abstract A typical abstraction of such a document will look like this. Issues with Designing Large Scale Libraries Based on NCSTRL K. Maly, M. Zubair, H. Anan, D. Tan, and Y. Zchang Department of Computer Science, Old Dominion University NCSTRL is an unified ….. What happens when a user creates a report, which does not have all the elements properly filled. For example, by mistake user left out the authors. In such a case this particular document will not be retrieved for queries where it should have been. There should be a mean of validating the document before it can be accepted by the computer. How does the computer decide that a document is valid or not. For this we need to have a formal definition of rules that describes the document type (in our example technical report).

Document Type Different communities have different types of documents with different abstractions. For example, lawyers deal with documents that are different from the ones scientist works on. Sample markups used in a court document ( Attorney for the Plaintiff Margret Marly Jefferson Margret Marly Jefferson New Mexico 1982 Active

Document Type There is a need to define what it means for each type of document to be valid. The rules describing the validity of a document of court type document will be different from rules describing the validity of a document of technical abstract type document. The formal definition that describe these rules for each type is called (Document Model) Document Type Definition (DTD) Summary Different communities will develop their own vocabulary (set of element types) and will describe constraints in terms of these element types on the document formation.

Summary Inserting markups help in abstracting the document that is giving an abstract structure and thus making it feasible to process documents electronically for handling user queries. Element refers to various components of a document. Separating abstract structure from presentation is good for supporting several different presentation formats for the same document. Different communities requires different set of markups that describes various elements of a document of their type.

What is XML? Extensible Markup Language allows a community to define their markups to describe various elements of the document along with constraints on the use of these markups. A document with markups that follow XML guidelines is referred as XML document. Markup (Tag) describes an element of a document. Document Type Definition (DTD) is a file that contains set of rules for using XML to represent documents of a particular type Stylesheet is a file that contains set of rules for generating formatted renditions from the XML abstractions XML is a subset of SGML ( a much older standard) -- the main difference being DTD is not required for a XML document, but you can use it if you have one.

Java and XML Complete portable solution XML: Data portability Java: Code portability Example: Business to Business application Support on Internet for suppliers for a large retailer Data movement between the supplier and the retailer can be handled using XML enabling it for machine processing, and can be easily validated. XML also allows the use of a standard XML parser, which would not have been possible if using proprietary data formats. The code developed in Java to manage XML data can be moved to different platform, is Internet enabled, and support support development of internationalized applications (unicode support)

Simple Example Original document Hello World Original document with semantic markups Hello World DTD describing the rules for markups used // element greeting contains salutation and addressee // element salutation contains character data // element addressee contains character data NOTE: Syntax for describing DTDs will be covered later.

XML Document for the Simple Example Hello World Contents of greeting.dtd file Instance of a document class identified by DTD (greeting.dtd) Prolog identifies the version of xml and the document type to which the xml document conforms

Another Example XML DOCUMENT Issues with Designing Large Scale Libraries Based on NCSTRL K. Maly, M. Zubair, H. Anan, D. Tan, and Y. Zchang Department of Computer Science, Old Dominion University NCSTRL is an unified ….. Contents of report.dtd file

Well Formed XML Documents Every start tag has a counterpart end tag and all elements compose a tree. Example (Use IE to view the file) SMITH AUBREY HILLSIDE DR PO BOX Old Town

Conclusion XML technology is playing a central role in building digital libraries.