Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

Similar presentations


Presentation on theme: "Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National."— Presentation transcript:

1

2 Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2006/10/5

3 2 World Wide Web The World Wide Web (Web) is a network of information resources. The Web relies on three mechanisms to make these resources available: 1.A uniform naming scheme for locating resources on the web (e.g., URIs). 2.Protocols, for access to named resources over the web (e.g., HTTP). 3.Hypertext, for easy navigation among resources (e.g., HTML).

4 3 Internet vs. Web Internet: –Internet is a more general term –Includes physical aspect of underlying networks and mechanisms such as email, FTP, HTTP… Web: –Associated with information stored on the Internet –Refers to a broader class of networks, i.e. Web of English Literature Both Internet and web are networks

5 4 Essential Components of WWW Resources (HTML, HyperText Markup Language) –Conceptual mappings to concrete or abstract entities, which do not change in the short term –Taggin support for structuring and laying out documents Resource identifiers (hyperlinks): –Strings of characters represent generalized addresses that may contain instructions for accessing the identified resource –http://www.google.com/ is used to identify the Google homepage Transfer protocols (HTTP, HyperText Transmission Protocol) –Conventions that regulate the communication between a browser (web user agent) and a server

6 5 Standard Generalized Markup Language (SGML) Based on GML (generalized markup language), developed by IBM in the 1960s An international standard (ISO 8879:1986) defines how descriptive markup should be embedded in a document –Markup: extra information characterizing structure of a document Gave birth to the extensible markup language (XML), W3C recommendation in 1998

7 6 SGML Components SGML documents have three parts: –Declaration: specifies which characters and delimiters may appear in the application –DTD (Document Type Definition)/ style sheet: defines the syntax of markup constructs –Document instance: actual text (with the tag) of the documents More info could be found: http://www.W3.Org/markup/SGML http://www.W3.Org/markup/SGML

8 7 HTML Background HTML was originally developed by Tim Berners- Lee while at CERN, and popularized by the Mosaic browser developed at NCSA. The Web depends on Web page authors and vendors sharing the same conventions for HTML. This has motivated joint work on specifications for HTML. HTML standards are organized by W3C : http://www.w3.org/MarkUp/ http://www.w3.org/MarkUp/

9 8 HTML Functionalities HTML gives authors the means to: –Publish online documents with headings, text, tables, lists, photos, etc Include spread-sheets, video clips, sound clips, and other applications directly in their documents –Link information via hypertext links, at the click of a button –Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc

10 9 Sample Webpage

11 10 Sample Webpage: HTML Structure The title of the webpage Body of the webpage

12 11 HTML Structure An HTML document is divided into a head section (here, between and ) and a body (here, between and ) The title of the document appears in the head (along with other information about the document) The content of the document appears in the body. The body in this example contains just one paragraph, marked up with

13 12 HTML Hyperlink alumni A link is a connection from one Web resource to another It has two ends, called anchors, and a direction Starts at the "source" anchor and points to the "destination" anchor, which may be any Web resource (e.g., an image, a video clip, a sound bite, a program, an HTML document)

14 13 Resource Identifiers Uniform Resource Identifiers (URI): include two overlapping subsets of identifiers –URL: Uniform Resource Locators –URN: Uniform Resource Names

15 14 Introduction to URIs Every resource available on the Web has an address that may be encoded by a URI URIs typically consist of three pieces: –The naming scheme of the mechanism used to access the resource. (HTTP, FTP) –The name of the machine hosting the resource –The name of the resource itself, given as a path

16 15 URI Example http://www.w3.org/TR There is a document available via the HTTP protocol Residing on the machines hosting www.w3.org www.w3.org Accessible via the path "/TR"

17 16 Protocols Describe how messages are encoded and exchanged Different Layering Architectures ISO OSI 7-Layer Architecture TCP/IP 4-Layer Architecture

18 17 ISO OSI Layering Architecture

19 18 TCP/IP Layering Architecture

20 19 TCP/IP Layering Architecture A simplified model, provides the end-to- end reliable connection The network layer –Hosts drop packages into this layer, layer routes towards destination –Only promise “Try my best” The transport layer –Reliable byte-oriented stream

21 20 Hypertext Transfer Protocol (HTTP) A connection-oriented protocol (TCP) used to carry WWW traffic between a browser and a server One of the transport layer protocol supported by Internet HTTP communication is established via a TCP connection and server port 80

22 21 GET Method in HTTP

23 22 Form

24 23 Form [1] Median Eminence ( 可複選 ): 1. 分泌 2. 一般 3. 王錫 崗. 垂體 其他 :

25 24 CGI processing

26 25 CGI (Common Gateway Interface) Web Browser Web Server Database CGI Service Request Service Processing Output Service Response

27 26 HTTP Request Processing

28 27 GNU Wget

29 28 CGI: Get query search-results from Google using Wget

30 29 Homework (1) Meta-search engine: dispatch the user query to several engines at same time, collect and merge the results into one list to the user. Homework: Develop a meta-search engine which responds user query with combined search results from a few search engines.

31 30 Domain Name System DNS (domain name service): mapping from domain names to IP address IPv4: IPv4 was initially deployed January 1 st. 1983 and is still the most commonly used version. 32 bit address, a string of 4 decimal numbers separated by dot, range from 0.0.0.0 to 255.255.255.255. IPv6: Revision of IPv4 with 128 bit address

32 31 Top Level Domains (TLD) Top level domain names,.com,.edu,.gov and ISO 3166 country codes.de,.fr,.it There are three types of top-level domains: Generic domains were created for use by the Internet publicGeneric domains Country code domains were created to be used by individual countryCountry code The.arpa domain Address and Routing Parameter Area domain is designated to be used exclusively for Internet- infrastructure purposesThe.arpa domain

33 32 Server Log Files Server Transfer Log: transactions between a browser and server are logged IP address, the time of the request Method of the request (GET, HEAD, POST…) Status code, a response from the server Size in byte of the transaction Referrer Log: w here the request originated Agent Log: browser software making the request (spider) Error Log: request resulted in errors (404)

34 33 Server Log Analysis Most and least visited web pages Entry and exit pages Referrals from other sites or search engines What are the searched keywords How many clicks/page views a page received Error reports, like broken links

35 34 Server Log Analysis

36 35 Search Engines According to Pew Internet & American Life Project Report (2002), search engines are the most popular way to locate information onlinePew Internet & American Life Project About 33 million U.S. Internet users query on search engines on a typical day. More than 80% have used search engines Search Engines are measured by coverage and recency

37 36 Web Crawler A crawler is a program that picks up a page and follows all the links on that page Crawler = Spider Types of crawler: –Breadth First –Depth First

38 37 Breadth First Crawlers Use breadth-first search (BFS) algorithm Get all links from the starting page, and add them to a queue Pick the 1 st link from the queue, get all links on the page and add to the queue Repeat above step till queue is empty

39 38 Breadth First Crawlers

40 39 Depth First Crawlers Use depth first search (DFS) algorithm Get the 1 st link not visited from the start page Visit link and get 1 st non-visited link Repeat above step till no non-visited links Go to next non-visited link in the previous level and repeat 2 nd step

41 40 Depth First Crawlers

42 41 Coverage Overlap analysis used for estimating the size of the indexable web W: set of webpages Wa, Wb: pages crawled by two independent engines a and b P(Wa), P(Wb): probabilities that a page was crawled by a or b –P(Wa)=|Wa| / |W| –P(Wb)=|Wb| / |W|

43 42 Overlap Analysis P(Wa  Wb| Wb) = P(Wa  Wb)/ P(Wb) = |Wa  Wb| / |Wb| If a and b are independent: –P(Wa  Wb) = P(Wa)*P(Wb) –P(Wa  Wb| Wb) = P(Wa)*P(Wb)/P(Wb) = |Wa| / |W| * (|Wb| / |W|) / (|Wb| / |W|) = |Wa| / |W| = P(Wa)

44 43 Overlap Analysis Using |W| = |Wa|/ P(Wa), the researchers found: –Web had at least 320 million pages in 1997 –60% of web was covered by six major engines –Maximum coverage of a single engine was 1/3 of the web

45 44 How to Improve the Coverage? Meta-search engine: dispatch the user query to several engines at same time, collect and merge the results into one list to the user. Any suggestions? Homework: Develop a meta-search engine which responds user query with combined search results from a few search engines.

46 45 Probability Model uncertainty: make inferences about events given observed data An event e: proposition or statement about the world at large –“the number of Web pages in existence on 1 January 2003 was greater than five billion” A probability P(e): can be viewed as a number that reflects our uncertainty about whether e is true or false in the real world, given whatever information we have available.

47 46 Learning from a Bayesian Perspective A conditional probability P(e | D): represent the degree of belief (Bayesian interpretation of probability), where D is the background information (data) on which our belief is based. Bayesian approach: probability as being a dynamic entity updated when more data arrive –Prior probability: P(e) is your belief in the event e before you see any data –Posterior probability: P(e | D) reflects your updated belief in event e given the observed data D –Likelihood: P(D | e) is the probability of the data under the assumption that e is true How to model P(D | e)?

48 47 Standard Probabilistic Distribution Discrete distributionsContinuous distributions Geometric Poisson Exponential Gamma

49 48 Learning from a Bayesian Perspective (cont.) Take logarithms for easier operations Obtain more data D 2 (second data set)

50 49 Parameter Estimation from Data Maximum a posteriori (MAP) –The objective of parameter estimation is to find or approximate the best set of parameters for a model, i.e., to find the set of parameters  maximizing the posterior P(  |D), or log P(  |D). This is called maximum a posteriori (MAP) estimation. –To deal with positive quantities, we can minimize - log P(  |D) –P(D) plays the role of a normalizing constant and is thus irrelevant for the optimization, i.e.,the minimization of –If the prior P(  ) is uniform over sample space, then the problem reduces to finding the maximum of P(D|  ), or log P(D|  ). This is known as maximum likelihood (ML) estimation. –Simpler ML estimation procedure, i.e., the minimization of

51 Basic Formula


Download ppt "Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National."

Similar presentations


Ads by Google