Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines.

Similar presentations


Presentation on theme: "Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines."— Presentation transcript:

1

2 Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines.

3 2 What Is the World Wide Web? The world wide web (web) is a network of information resources. The web relies on three mechanisms to make these resources readily available to the widest possible audience: 1. A uniform naming scheme for locating resources on the web (e.g., URIs). 2. Protocols, for access to named resources over the web (e.g., HTTP). 3. Hypertext, for easy navigation among resources (e.g., HTML).

4 3 Internet vs. Web Internet: Internet is a more general term Includes physical aspect of underlying networks and mechanisms such as email, FTP, HTTP… Web: Associated with information stored on the Internet Refers to a broader class of networks, i.e. Web of English Literature Both Internet and web are networks

5 4 Essential Components of WWW Resources: Conceptual mappings to concrete or abstract entities, which do not change in the short term ex: ICS website (web pages and other kinds of files) Resource identifiers (hyperlinks): Strings of characters represent generalized addresses that may contain instructions for accessing the identified resource http://www.ics.uci.edu is used to identify the ICS homepagehttp://www.ics.uci.edu Transfer protocols: Conventions that regulate the communication between a browser (web user agent) and a server

6 5 Standard Generalized Markup Language (SGML) Based on GML (generalized markup language), developed by IBM in the 1960s An international standard (ISO 8879:1986) defines how descriptive markup should be embedded in a document Gave birth to the extensible markup language (XML), W3C recommendation in 1998

7 6 SGML Components SGML documents have three parts: Declaration: specifies which characters and delimiters may appear in the application DTD/ style sheet: defines the syntax of markup constructs Document instance: actual text (with the tag) of the documents More info could be found: http://www.W3.Org/markup/SGML http://www.W3.Org/markup/SGML

8 7 DTD Example One ELEMENT is a keyword that introduces a new element type unordered list (UL) The two hyphens indicate that both the start tag and the end tag for this element type are required Any text between the two tags is treated as a list item (LI)

9 8 DTD Example Two The element type being declared is IMG The hyphen and the following "O" indicate that the end tag can be omitted Together with the content model "EMPTY", this is strengthened to the rule that the end tag must be omitted. (no closing tag)

10 9 HTML Background HTML was originally developed by Tim Berners- Lee while at CERN, and popularized by the Mosaic browser developed at NCSA. The Web depends on Web page authors and vendors sharing the same conventions for HTML. This has motivated joint work on specifications for HTML. HTML standards are organized by W3C : http://www.w3.org/MarkUp/ http://www.w3.org/MarkUp/

11 10 HTML Functionalities HTML gives authors the means to: Publish online documents with headings, text, tables, lists, photos, etc –Include spread-sheets, video clips, sound clips, and other applications directly in their documents Link information via hypertext links, at the click of a button Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc

12 11 HTML Versions HTML 4.01 is a revision of the HTML 4.0 Recommendation first released on 18th December 1997. –HTML 4.01 Specification: http://www.w3.org/TR/1999/REC-html401-19991224/html40.txt HTML 4.0 was first released as a W3C Recommendation on 18 December 1997 HTML 3.2 was W3C's first Recommendation for HTML which represented the consensus on HTML features for 1996 HTML 2.0 (RFC 1866) was developed by the IETF's HTML Working Group, which set the standard for core HTML features based upon current practice in 1994.RFC 1866

13 12 Sample Webpage

14 13 Sample Webpage HTML Structure The title of the webpage Body of the webpage

15 14 HTML Structure An HTML document is divided into a head section (here, between and ) and a body (here, between and ) The title of the document appears in the head (along with other information about the document) The content of the document appears in the body. The body in this example contains just one paragraph, marked up with

16 15 HTML Hyperlink alumni A link is a connection from one Web resource to another It has two ends, called anchors, and a direction Starts at the "source" anchor and points to the "destination" anchor, which may be any Web resource (e.g., an image, a video clip, a sound bite, a program, an HTML document)

17 16 Resource Identifiers URI: Uniform Resource Identifiers URL: Uniform Resource Locators URN: Uniform Resource Names

18 17 Introduction to URIs Every resource available on the Web has an address that may be encoded by a URI URIs typically consist of three pieces: The naming scheme of the mechanism used to access the resource. (HTTP, FTP) The name of the machine hosting the resource The name of the resource itself, given as a path

19 18 URI Example http://www.w3.org/TR There is a document available via the HTTP protocol Residing on the machines hosting www.w3.org www.w3.org Accessible via the path "/TR"

20 19 Protocols Describe how messages are encoded and exchanged Different Layering Architectures ISO OSI 7-Layer Architecture TCP/IP 4-Layer Architecture

21 20 ISO OSI Layering Architecture

22 21 ISO’s Design Principles A layer should be created where a different level of abstraction is needed Each layer should perform a well-defined function The layer boundaries should be chosen to minimize information flow across the interfaces The number of layers should be large enough that distinct functions need not be thrown together in the same layer, and small enough that the architecture does not become unwieldy

23 22 TCP/IP Layering Architecture

24 23 TCP/IP Layering Architecture A simplified model, provides the end-to- end reliable connection The network layer –Hosts drop packages into this layer, layer routes towards destination –Only promise “Try my best” The transport layer –Reliable byte-oriented stream

25 24 Hypertext Transfer Protocol (HTTP) A connection-oriented protocol (TCP) used to carry WWW traffic between a browser and a server One of the transport layer protocol supported by Internet HTTP communication is established via a TCP connection and server port 80

26 25 GET Method in HTTP

27 26 Domain Name System DNS (domain name service): mapping from domain names to IP address IPv4: IPv4 was initially deployed January 1 st. 1983 and is still the most commonly used version. 32 bit address, a string of 4 decimal numbers separated by dot, range from 0.0.0.0 to 255.255.255.255. IPv6: Revision of IPv4 with 128 bit address

28 27 Top Level Domains (TLD) Top level domain names,.com,.edu,.gov and ISO 3166 country codes There are three types of top-level domains: Generic domains were created for use by the Internet publicGeneric domains Country code domains were created to be used by individual countryCountry code The.arpa domain Address and Routing Parameter Area domain is designated to be used exclusively for Internet- infrastructure purposesThe.arpa domain

29 28 Registrars Domain names ending with.aero,.biz,.com,.coop,.info,.museum,.name,.net,.org, or.pro can be registered through many different companies (known as "registrars") that compete with one another InterNIC at http://internic.nethttp://internic.net Registrars Directory: http://www.internic.net/regist.html http://www.internic.net/regist.html

30 29 Server Log Files Server Transfer Log: transactions between a browser and server are logged IP address, the time of the request Method of the request (GET, HEAD, POST…) Status code, a response from the server Size in byte of the transaction Referrer Log: w here the request originated Agent Log: browser software making the request (spider) Error Log: request resulted in errors (404)

31 30 Server Log Analysis Most and least visited web pages Entry and exit pages Referrals from other sites or search engines What are the searched keywords How many clicks/page views a page received Error reports, like broken links

32 31 Server Log Analysis

33 32 Search Engines According to Pew Internet Project Report (2002), search engines are the most popular way to locate information onlinePew Internet Project About 33 million U.S. Internet users query on search engines on a typical day. More than 80% have used search engines Search Engines are measured by coverage and recency

34 33 Coverage Overlap analysis used for estimating the size of the indexable web W: set of webpages Wa, Wb: pages crawled by two independent engines a and b P(Wa), P(Wb): probabilities that a page was crawled by a or b P(Wa)=|Wa| / |W| P(Wb)=|Wb| / |W|

35 34 Overlap Analysis P(Wa  Wb| Wb) = P(Wa  Wb)/ P(Wb) = |Wa  Wb| / |Wb| If a and b are independent: P(Wa  Wb) = P(Wa)*P(Wb) P(Wa  Wb| Wb) = P(Wa)*P(Wb)/P(Wb) = |Wa| * |Wb| / |Wb| = |Wa| / |W| =P(Wa)

36 35 Overlap Analysis Using |W| = |Wa|/ P(Wa), the researchers found: Web had at least 320 million pages in 1997 60% of web was covered by six major engines Maximum coverage of a single engine was 1/3 of the web

37 36 How to Improve the Coverage? Meta-search engine: dispatch the user query to several engines at same time, collect and merge the results into one list to the user. Any suggestions?

38 37 Web Crawler A crawler is a program that picks up a page and follows all the links on that page Crawler = Spider Types of crawler: –Breadth First –Depth First

39 38 Breadth First Crawlers Use breadth-first search (BFS) algorithm Get all links from the starting page, and add them to a queue Pick the 1 st link from the queue, get all links on the page and add to the queue Repeat above step till queue is empty

40 39 Breadth First Crawlers

41 40 Depth First Crawlers Use depth first search (DFS) algorithm Get the 1 st link not visited from the start page Visit link and get 1 st non-visited link Repeat above step till no no-visited links Go to next non-visited link in the previous level and repeat 2 nd step

42 41 Depth First Crawlers

43 WEB GRAPHS

44 43 Internet/Web as Graphs Graph of the physical layer with routers, computers etc as nodes and physical connections as edges It is limited Does not capture the graphical connections associated with the information on the Internet Web Graph where nodes represent web pages and edges are associated with hyperlinks

45 44 Web Graph http://www.touchgraph.com/TGGoogleBrowser.html

46 45 Web Graph Considerations Edges can be directed or undirected Graph is highly dynamic Nodes and edges are added/deleted often Content of existing nodes is also subject to change Pages and hyperlinks created on the fly Apart from primary connected component there are also smaller disconnected components

47 46 Why the Web Graph? Example of a large,dynamic and distributed graph Possibly similar to other complex graphs in social, biological and other systems Reflects how humans organize information (relevance, ranking) and their societies Efficient navigation algorithms Study behavior of users as they traverse the web graph (e-commerce)

48 47 Statistics of Interest Size and connectivity of the graph Number of connected components Distribution of pages per site Distribution of incoming and outgoing connections per site Average and maximal length of the shortest path between any two vertices (diameter)

49 48 Properties of Web Graphs Connectivity follows a power law distribution The graph is sparse |E| = O(n) or atleast o(n 2 ) Average number of hyperlinks per page roughly a constant A small world graph

50 49 Power Law Size Simple estimates suggest over a billion nodes Distribution of site sizes measured by the number of pages follow a power law distribution Observed over several orders of magnitude with an exponent  in the 1.6-1.9 range

51 50 Power Law Connectivity Distribution of number of connections per node follows a power law distribution Study at Notre Dame University reported  = 2.45 for outdegree distribution  = 2.1 for indegree distribution Random graphs have Poisson distribution if p is large. Decays exponentially fast to 0 as k increases towards its maximum value n-1

52 51 Power Law Distribution - Examples http://www.pnas.org/cgi/reprint/99/8/5207.pdf

53 52 Examples of networks with Power Law Distribution Internet at the router and interdomain level Citation network Collaboration network of actors Networks associated with metabolic pathways Networks formed by interacting genes and proteins Network of nervous system connection in C. elegans

54 53 Small World Networks It is a ‘small world’ Millions of people. Yet, separated by “six degrees” of acquaintance relationships Popularized by Milgram’s famous experiment Mathematically Diameter of graph is small (log N) as compared to overall size 3. Property seems interesting given ‘sparse’ nature of graph but … This property is ‘natural’ in ‘pure’ random graphs

55 54 The small world of WWW Empirical study of Web-graph reveals small- world property Average distance (d) in simulated web: d = 0.35 + 2.06 log (n) e.g. n = 10 9, d ~= 19 Graph generated using power-law model Diameter properties inferred from sampling Calculation of max. diameter computationally demanding for large values of n

56 55 Implications for Web Logarithmic scaling of diameter makes future growth of web manageable 10-fold increase of web pages results in only 2 more additional ‘clicks’, but … Users may not take shortest path, may use bookmarks or just get distracted on the way Therefore search engines play a crucial role

57 56 Some theoretical considerations Classes of small-world networks Scale-free: Power-law distribution of connectivity over entire range Broad-scale: Power-law over “broad range” + abrupt cut-off Single-scale: Connectivity distribution decays exponentially

58 57 Power Law of PageRank Assess importance of a page relative to a query and rank pages accordingly Importance measured by indegree Not reliable since it is entirely local PageRank – proportion of time a random surfer would spend on that page at steady state A random first order Markov surfer at each time step travels from one page to another

59 58 PageRank contd Page rank r(v) of page v is the steady state distribution obtained by solving the system of linear equations given by Where pa[v] = set of parent nodes Ch[u] = out degree

60 59 Examples Log Plot of PageRank Distribution of Brown Domain (*.brown.edu) G.Pandurangan, P.Raghavan,E.Upfal,”Using PageRank to characterize Webstructure”,COCOON 2002

61 60 Bow-tie Structure of Web A large scale study (Altavista crawls) reveals interesting properties of web Study of 200 million nodes & 1.5 billion links Small-world property not applicable to entire web Some parts unreachable Others have long paths Power-law connectivity holds though Page indegree (  = 2.1), outdegree (  = 2.72)

62 61 Bow-tie Components Strongly Connected Component (SCC) Core with small-world property Upstream (IN) Core can’t reach IN Downstream (OUT) OUT can’t reach core Disconnected (Tendrils)

63 62 Component Properties Each component is roughly same size ~50 million nodes Tendrils not connected to SCC But reachable from IN and can reach OUT Tubes: directed paths IN->Tendrils->OUT Disconnected components Maximal and average diameter is infinite

64 63 Empirical Numbers for Bow-tie Maximal minimal (?) diameter 28 for SCC, 500 for entire graph Probability of a path between any 2 nodes ~1 quarter (0.24) Average length 16 (directed path exists), 7 (undirected) Shortest directed path between 2 nodes in SCC: 16-20 links on average

65 64 Models for the Web Graph Stochastic models that can explain or atleast partially reproduce properties of the web graph The model should follow the power law distribution properties Represent the connectivity of the web Maintain the small world property

66 65 Web Page Growth Empirical studies observe a power law distribution of site sizes Size includes size of the Web, number of IP addresses, number of servers, average size of a page etc A Generative model is being proposed to account for this distribution

67 66 Component One of the Generative Model The first component of this model is that “ sites have short-term size fluctuations up or down that are proportional to the size of the site “ A site with 100,000 pages may gain or lose a few hundred pages in a day whereas the effect is rare for a site with only 100 pages

68 67 Component Two of the Generative Model There is an overall growth rate  so that the size S(t) satisfies S(t+1) =  (1+  t  )S(t) where -  t is the realization of a +-1 Bernoulli random variable at time t with probability 0.5 - b is the absolute rate of the daily fluctuations

69 68 Component Two of the Generative Model contd After T steps so that

70 69 Theoretical Considerations Assuming  t independent, by central limit theorem it is clear that for large values of T, log S(T) is normally distributed The central limit theorem states that given a distribution with a mean μ and variance σ 2, the sampling distribution of the mean approaches a normal distribution with a mean (μ) and a variance σ 2 /N as N, the sample size, increases. http://davidmlane.com/hyperstat/A14043.html

71 70 Theoretical Considerations contd Log S(T) can also be associated with a binomial distribution counting the number of time  t = +1 Hence S(T) has a log-normal distribution The probability density and cumulative distribution functions for the log normal distribution

72 71 Modified Model Can be modified to obey power law distribution Model is modified to include the following inorder to obey power law distribution A wide distribution of growth rates across different sites and/or The fact that sites have different ages

73 72 Capturing Power Law Property Inorder to capture Power Law property it is sufficient to consider that Web sites are being continuously created Web sites grow at a constant rate  during a growth period after which their size remains approximately constant The periods of growth follow an exponential distribution This will give a relation = 0.8  between the rate of exponential distribution and  the growth rage when power law exponent  = 1.08

74 73 Lattice Perturbation (LP) Models Some Terms “Organized Networks” (a.k.a Mafia) Each node has same degree k and neighborhoods are entirely local Probability of Edge (a,b) = 1 if dist (a,b) = 1 0 otherwise Note: We are talking about graphs that can be mapped to a Cartesian plane

75 74 Terms (Cont’d) Organized Networks Are ‘cliquish’ (Subgraph that is fully connected) in local neighborhood Probability of edges across neighborhoods is almost non existent (p=0 for fully organized) “Disorganized” Networks ‘Long-range’ edges exist Completely Disorganized Fully Random (Erdos Model) : p=1

76 75 Semi-organized (SO) Networks Probability for long-range edge is between zero and one Clustered at local level (cliquish) But have long-range links as well Leads to networks that Are locally cliquish And have short path lengths

77 76 Creating SO Networks Step 1: Take a regular network (e.g. lattice) Step 2: Shake it up (perturbation) Step 2 in detail: For each vertex, pick a local edge ‘Rewire’ the edge into a long-range edge with a probability (p) p=0: organized, p=1: disorganized

78 77 Statistics of SO Networks Average Diameter (d): Average distance between two nodes Average Clique Fraction (c) Given a vertex v, k(v): neighbors of v Max edges among k(v) = k(k-1)/2 Clique Fraction (c v ): (Edges present) / (Max) Average clique fraction: average over all nodes Measures: Degree to which “my friends are friends of each other”

79 78 Statistics (Cont’d) Statistics of common networks: nkdc Actors225,226613.650.79 Power- grid 4,9412.6718.70.08 C.elegans282142.650.28 Large k = large c? Small c = large d?

80 79 Other Properties For graph to be sparse but connected: n >> k >> log(n) >>1 As p --> 0 (organized) d ~= n/2k >>1, c ~= 3/4 Highly clustered & d grows linearly with n As p --> 1 (disorganized) d ~= log(n)/log(k), c ~= k/n << 1 Poorly clustered & d grows logarithmically with n

81 80 Effect of ‘Shaking it up’ Small shake (p close to zero) High cliquishness AND short path lengths Larger shake (p increased further from 0) d drops rapidly (increased small world phenomena_ c remains constant (transition to small world almost undetectable at local level) Effect of long-range link: Addition: non-linear decrease of d Removal: small linear decrease of c

82 81 LP and The Web LP has severe limitations No concept of short or long links in Web A page in USA and another in Europe can be joined by one hyperlink Edge rewiring doesn’t produce power-law connectivity! Degree distribution bounded & strongly concentrated around mean value Therefore, we need other models …


Download ppt "Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines."

Similar presentations


Ads by Google