Download presentation
Presentation is loading. Please wait.
1
Markup for Statisticians An Introduction to Alphabet Soup
2
WWW In the 1980’s the world wide web (WWW) came in to being for documentation on most projects that impact on the WWW look at –www.w3.org a major factor in its success was the notion of a markup language
3
WWW the technology hurdle that this overcame was the separation of content from presentation a web browser is responsible for understanding and rendering the content in a web page
4
WWW that content is marked-up using HTML (or a relative) on IE5, under the View menu you will find an option for source select a web page and view the source
5
WWW this is very nice – now everyone can view your page using any browser all the browser has to do is to understand and implement a number of HTML directives notions such as linking (directing people to another place by a click) etc are easily implemented in this frame work
6
WWW NCSA puts out a document entitled –A Beginner’s Guide to HTML this is one of the better guides I have seen There are many books (most much longer than they need to be) O’Reilly’s HTML Pocket Reference seems pretty useful
7
WWW How does it work? Your web browser opens a special type of connection (an http connection usually) to another computer and through that protocol asks for the information on a particular web page Other types of connections, such as ftp, are also generally supported
8
WWW now we have solved the problem of how to put content onto your computer how do we solve the problem of providing programs or applications to perform some computations? this is where Java came in Java is a language that has a strong security model
9
WWW: Java Java applets can be secured in the sense that you can determine before you run them that they will do nothing harmful to your computer if you could not ensure that you would be ill advised to run an applet this is why there are no C or C++ applets they can be written but no one should be silly enough to run one
10
WWW while all web browsers use http as their basic means of transferring data other programs can also use http now the web is full of information about all sorts of topics how do we begin to make sense of that information?
11
WWW HTML has a severe limitations these became apparent when search engines were first being developed the problem is that there is no way to indicate the meaning of any of the information for example consider the tags that you have available for a table
12
WWW: Table tags the table tags are: –,,,, and –a few more in HTML 4.0 except by convention there is no way to indicate the content of the table but tables often contain data – data that we want to use without information on content it is hard to use the data programmatically
13
WWW we want to have smart programs there is no sense in having people find and manipulate data – if it is on the web it would be nice if it were in a format that a program could deal with the more we can automate the more we can do
14
WWW and R open R and look at the manual page for connections look at URL connections we want to open a connection to Leo Breiman’s home page –bhp <- url(“http://oz.berkeley.edu/users/breiman/”, open=“r”) –bhp.content <- readLines(bhp)
15
WWW and R now look at what bhp.content contains Dr. Breiman has also put up a data set at –http://oz.berkeley.edu/users/breiman/glass6.dat open a url connection to this page and read the data what does it look like? what would we like to do with it?
16
WWW and R we would probably like to put it into a dataframe we would also like to know what the data means there is no way to do that with HTML except by convention and even then we have to parse the data writing parsers is complicated
17
WWW and XML The eXtensible Markup Language is intended to provide the missing functionality it comes with a number of additional tools XSL, XSLT, Xpointer, Xlink and Xpath XML is a simplified form of SGML
18
XML is becoming the standard for data transfer it is also becoming popular for tasks like remote procedure calls, for communicating between cooperative computing languages via SOAP
19
XML with XML you can define your own tags –,, and so on to give them meaning you use a Document Type Definition (or DTD) the DTD specifies which tags are valid, which attributes a tag can have and also the order (or nesting) requirements
20
XML in XML all open tags must have a corresponding closing tag, – must be followed by with any other tags that have been opened after closed before –this ensures proper nesting of the XML tags and makes it possible to parse the documents easily
21
XML an element consists of two tags, an opening tag and a closing tag – orange is an element any text between the tags is considered to be part of the element and is formatted according to the rules for that element
22
XML elements can have attributes – 24 notice that under these circumstances it is reasonably easy to extract all the heights from an XML document (and to get the units right!) attribute values must be contained inside of quotation marks, either double or single
23
XML a non-empty element must have both an opening and a closing tag an empty element might be there as a place holder or to provide its attribute – is an empty element, the closing tag is not required but we had to put a / before the closing >
24
XML tags must be nested correctly so the following is not allowed – that’s all folks since bar is the second tag it must be the first one to close an XML document that adheres to these rules is said to be well—formed
25
XML well—formed XML documents can be parsed using standard methods an second concept that can be applied to XML documents is validity an XML document is said to be valid if it conforms to its DTD XML documents can be well—formed but not valid
26
XML XML documents can be useful even when there is no DTD in other situations (eg my system for documenting clinical trials) the use of a DTD to ensure validity is necessary recently the DTD specification has been extended – the new method is called schema and is more flexible than a DTD
27
XML PI – processing instructions a PI tells an application to carry out a specific task a PI is not part of the rendered document but rather is an instruction to either the XML parser or to an application that uses the resultant document
28
XML PI’s are of the form: – An example of a PI: – this PI is included as the first line in almost all XML documents it indicates the versin and standalone=no indicates that a DTD is required
29
XML Namespaces: we need some means of limiting the scope of the definition of a tag suppose we have combined two DTD’s in a single XML document (this is both legal and useful) suppose that both DTD’s define a tag named leg except in one it stands for a person’s leg
30
XML and in the other the leg of a chair we wouldn’t want to mix those up namespaces can be used to ensure that tags from one DTD do not get confused with tags from another namespaces really don’t do anything though they are simply macro substitutions
31
XML namespaces should be unique it is common to use a URI (which need not exist) www.rgentleman.org from here on tags can use and this is the equivalent of prepending the namespace string to the tag
32
XSL eXtensible Stylesheet Language this has not yet been completely formed (but should be soon) a style sheet describes how the XML document should be transformed to provide the rendered output you can have multiple style sheets for any XML document
33
XSL this means that you can have different versions of the document depending on whether the output is a Web page, a pdf document, input for another processing step and so on XSL (through XSLT) provides a means of rendering the data in an XML document
34
XPath an XML document has a tree structure there is a root node and below that there can be many more nodes for XSLT (and Xpointer) to work well they need to be able to reference different elements within the document they do this via XPath
35
XPath a simple example *[not(self::FOO:Bar)] is an Xpath statement that refers to all children of the current node whose name (the tag) is not FOO:Bar you can refer to parent nodes, children, grandparents and so on
36
XLL eXtensible Linking Language another part of the XML family are the mechanisms for linking different documents and portions of documents Xlink and Xpointer are the two mechanisms used to carry out the linking (similar to what goes on in a web page but with more control)
37
XLink a link is only an assertion of a relationship between pieces of a document (or documents) how that link is presented to the user depends on many things and can be quite different in different settings XML ID’s are used to provide unique labels for Xlink to link to
38
XPointer ID’s give you a flexible way to link to parts of the same document when you want to link to other documents then you need Xpointer the syntax is pretty complex
39
Literate Programming literate programming is an idea that originated with Don Knuth he wanted a system that allowed him to mix text and code in a more natural way so that documentation could be read easily by humans
40
Literate Programming to make the code runnable the code segments are extracted and placed in a separate file in the development version of R (and soon as a separate library) is a version of literate programming for the R language it is called Sweave
41
Sweave the idea is to produce a LaTeX like document that has a mix of LaTeX and R code this document is passed through an S engine and the code may be replaced by the output that it generates (including graphics)
42
Sweave this allows you to easily update reports when the data change it also allows you to document the code together with the report that the code is used to write see the Sweave User Manual that is also provided for today’s lecture
43
Sweave a second but important use for Sweave is to use it to document R packages using Sweave we can produce files that contain examples of analyses the Tangle facility allows us to extract the code segments into separate files and to run them
44
STangle Tangling is sort of the opposite of weaving it separates the components for R/S packages the text portion is generally not of interest the code portions allow us to ensure that the program is still functioning as we expect it allows us to put much more complex examples into our code
45
Sweave once this becomes a stable part of R I anticipate that most of you will find it a very useful device for doing homework assignments and data analyses
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.