Web basics HTTP – – URI/L/Ns – HTML –
HTTP operation Basic (top) vs. with Intermediaries User Agent Request Response Origin Server User Agent Origin Server Request chain Response chain Intermediaries: Proxies, gateways, tunnels
HTTP Terminology User Agent (UA): program acting on behalf of user. Resource: data object or service identified by a URI. Origin server (OS): server originating a resource Connection: transport session initiated by UA (but not always direct to OS). Typically TCP or SSL.
HTTP Terminology Message: formatted sequence of bytes: –Request: from client to server –Response: from server to client Message = startline + headers + body
Request and response messages GET /index.html HTTP/1.1 Host: User-Agent: Mozilla HTTP/ OK Content-Length: 45 Content-Language: en-us Content-Type: text/html Hello world
Requests GET, HEAD, POST PUT, DELETE OPTIONS, TRACE, CONNECT
Common request headers Host (required), User-Agent Referer Authorization If-Modified-Since, Cache-Control Accept[-Language/-Charset/-Encoding]
Common response codes 200 OK 301 Moved permanently, 307 Moved tmp 400 Bad request 401 Unauthorized, 403 Forbidden 404 Not found 500 Internal Server Error
Common response headers Content-Type, Content-Length, Content- Language Date, Last-Modified, Expires Location [for 3xx responses] Server
Response generation Theory (top) vs. practice ResourceVariantInstanceEntityMessage Selection (negotiation, UA optimization) Content encoding (gzip) Instance manipulations (range, delta) Transfer encoding (chunking, encryption) ResourceVariant/InstanceMessage Selection (UA optimization) Understanding the full model is necessary for a good understanding of caching, but we are going to ignore caching
Cookies Not part of official HTTP spec, but see: – – Adding state to “stateless” protocol OS adds Set-Cookie header to response: –Set-Cookie: sid=113a8fbc;version=1;path=/ UA adds Cookie header to future requests: –Cookie: sid=113a8fbc;$version=1;$path=/
URI/L/N Universal Resource… –Name: a persistent identifier (Under development) –Locator: (perhaps transient) locator information Typically: address plus access method –Identifier: either a URN or URL RFC2396 provides syntactic rules that all URIs must obey
HTTP URLs –“Fragments” are not strictly part of URLs –Relative URIs Canonicalization –Aggressively avoid false distinctions –But always keep a working URL
HTML Do a bit of review on the way frames and Javascript work
Problems for Archiving Links obscured by increasing use of Flash, Javascript, DHTML, PDF, Word, … Soft-404’s, 30x’s (Big pain!!) –Great example of non-cooperation Browser-specific content Servers lie about content –E.g., incorrect or missing Content-Type
Problems for Archiving Aliasing –Material is copied –Host has multiple names ( and foo.com typically the same) –Resource has multiple names (e.g., case- insensitivity)
Problems for archiving And this ignores spamming!