Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Harvard University CSCI E-2a Life, Liberty, and Happiness After the Digital Explosion 4: Search.

Similar presentations


Presentation on theme: "1 Harvard University CSCI E-2a Life, Liberty, and Happiness After the Digital Explosion 4: Search."— Presentation transcript:

1 1 Harvard University CSCI E-2a Life, Liberty, and Happiness After the Digital Explosion 4: Search

2 2 google.com google.cn baidu.cn

3 3

4 4 ARPAnet, 1971

5 5 Clients and Servers Client Computers Web Server www.google.com e-mail Server Mail.yahoo.com e-mail Server smtp.fas.harvard.edu downloadupload THE INTERNET

6 6 IP = Internet Protocol Store and Forward Switch = Router RRouter in network core receives incoming packets and stores them in “buffer” (temporary storage) RRoutes packets on outgoing links RMay throw packets away if buffer is full RRouter in network core receives incoming packets and stores them in “buffer” (temporary storage) RRoutes packets on outgoing links RMay throw packets away if buffer is full Routing Table

7 7 “End to End”: Intelligence at Edge of Network RRouters are relatively dumb and rely on intelligence at the edge to compensate Packetize Add serial #s Add fingerprint Add destination address Insert into network BEST EFFORT Reassemble packets (Maybe) report missing packets (Maybe) report damaged packets Deliver to application Client application: email, web browser, iTunes Server application

8 8 Packets RPacket size (1.5 KB max) a compromise RSmall enough that they can be “handled” quickly and with relatively low odds of being damaged RLarge enough that packaging does not outweigh the contents or “payload” RPacket size (1.5 KB max) a compromise RSmall enough that they can be “handled” quickly and with relatively low odds of being damaged RLarge enough that packaging does not outweigh the contents or “payload”

9 9 IP Addresses  IPv4: 32 bits written as 4 decimal numerals less than 256, e.g. 141.211.125.22 (UMich) R4 billion not enough RIPv6: 128 bits written as 8 blocks of 4 hex digits each, e.g. AF43:23BC:CAA1:0045:A5B2:90AC:FFEE:8080  At edge, translate URLs --> IP addresses, e.g. umich.edu --> 141.211.125.22 RAuthoritative sites for address translation = “Domain Name Server” (DNS)  In the network core, IP addresses are used to route packets using routing tables  IPv4: 32 bits written as 4 decimal numerals less than 256, e.g. 141.211.125.22 (UMich) R4 billion not enough RIPv6: 128 bits written as 8 blocks of 4 hex digits each, e.g. AF43:23BC:CAA1:0045:A5B2:90AC:FFEE:8080  At edge, translate URLs --> IP addresses, e.g. umich.edu --> 141.211.125.22 RAuthoritative sites for address translation = “Domain Name Server” (DNS)  In the network core, IP addresses are used to route packets using routing tables

10 10

11 11 But who controls the names and numbers? RICANN = Internet Corporation for Assigned Names and Numbers RA US nonprofit … but it’s a long story. RICANN = Internet Corporation for Assigned Names and Numbers RA US nonprofit … but it’s a long story.

12 12

13 13 The Internet is IP RRouters do not know what the bits in the packets represent RDo not know if they are email, streaming video, html web pages RDo not know if they are encrypted or unencrypted RYou can invent your own new service adhering to IP standards RGain Internet’s best-effort service Rand possibility of undelivered packets RRouters do not know what the bits in the packets represent RDo not know if they are email, streaming video, html web pages RDo not know if they are encrypted or unencrypted RYou can invent your own new service adhering to IP standards RGain Internet’s best-effort service Rand possibility of undelivered packets

14 14 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 1234

15 15 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 1 234

16 16 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 12 34

17 17 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 123 4

18 18 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 1234

19 19 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 1234

20 20 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 1234

21 21 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 1 234

22 22 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 12 34

23 23 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 123 4

24 24 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 1234

25 25 Striping Utilizes the Network RStore and Forward delays would add up if entire message had to be buffered at every router 1234

26 26 Striping Utilizes the Network RStore and Forward delays would add up if entire message had to be buffered at every router 1234

27 27 Striping Utilizes the Network RStore and Forward delays would add up if entire message had to be buffered at every router 1234

28 28 Striping Utilizes the Network RStore and Forward delays would add up if entire message had to be buffered at every router 1234

29 29 Striping Utilizes the Network RStore and Forward delays would add up if entire message had to be buffered at every router 1234

30 30 Striping Utilizes the Network RStore and Forward delays would add up if entire message had to be buffered at every router 1234

31 31 TCP Transport Control Protocol RCreates logical connection between two machines on the edge of the network RConnected machines seem to have a circuit connecting them even though they do not tie up the network RProvide reliable, perfect transport of messages, even though IP may drop packets RRegulates the rate at which packets are inserted into the network RCreates logical connection between two machines on the edge of the network RConnected machines seem to have a circuit connecting them even though they do not tie up the network RProvide reliable, perfect transport of messages, even though IP may drop packets RRegulates the rate at which packets are inserted into the network

32 32 TCP, Basic Idea

33 33 TCP, Basic Idea 1 2

34 34 TCP, Basic Idea 1 2 + “3-Way Handshaking”

35 35 TCP, Basic Idea 1 2 + “3-Way Handshaking”

36 36 TCP, Basic Idea 1 2 + “3-Way Handshaking”

37 37 TCP, Basic Idea 1 2 + “3-Way Handshaking”

38 38 TCP, Basic Idea 1 2 + “3-Way Handshaking”

39 39 TCP, Basic Idea 1 2 + “3-Way Handshaking”

40 40 TCP, Basic Idea 1 2 * “3-Way Handshaking”

41 41 TCP, Basic Idea 1 2 * “3-Way Handshaking”

42 42 TCP, Basic Idea 1 2 * “3-Way Handshaking”

43 43 TCP, Basic Idea 1 2 * “3-Way Handshaking”

44 44 TCP, Basic Idea 1 2 * “3-Way Handshaking”

45 45 TCP, Basic Idea 1 2 * “3-Way Handshaking”

46 46 TCP, Basic Idea 1 2 * “3-Way Handshaking”

47 47 TCP, Basic Idea 1 2 * “3-Way Handshaking”

48 48 TCP, Basic Idea 1 2 * “3-Way Handshaking”

49 49 TCP, Basic Idea 1 2 * “3-Way Handshaking”

50 50 TCP, Basic Idea 1 2 * “3-Way Handshaking”

51 51 TCP, Basic Idea 1 2 * “3-Way Handshaking”

52 52 TCP, Basic Idea 1 2 “Virtual Circuit” now established between two hosts though the routers in between are not aware of it and the same path need not be followed by all packets

53 53 TCP, Basic Idea 11 2

54 54 TCP, Basic Idea 11 2

55 55 TCP, Basic Idea 1 212

56 56 TCP, Basic Idea 11 22

57 57 TCP, Basic Idea 1 212

58 58 TCP, Basic Idea 11 22

59 59 TCP, Basic Idea ACK1 1 2 1 2

60 60 TCP, Basic Idea ACK1 1 2 1 2

61 61 TCP, Basic Idea ACK1 1 2 12 ACK2

62 62 TCP, Basic Idea ACK1 1 1 2 ACK2

63 63 TCP, Basic Idea ACK1 1 2 1 ACK2 2

64 64 TCP, Basic Idea 1 2 2 ACK2

65 65 TCP, Basic Idea 2 21 ACK2

66 66 TCP, Basic Idea 2 21 ACK2

67 67 TCP, Basic Idea 12

68 68 TCP, Basic Idea 12

69 69 Dropped Packets and Retransmission 11 2

70 70 Dropped Packets and Retransmission 11 2

71 71 Dropped Packets and Retransmission 11 22

72 72 Dropped Packets and Retransmission 1 22

73 73 Dropped Packets and Retransmission 1 22

74 74 Dropped Packets and Retransmission 1 22

75 75 Dropped Packets and Retransmission 1 22

76 76 Dropped Packets and Retransmission 1 22

77 77 Dropped Packets and Retransmission 1 21 2 TIMEOUT

78 78 The World Wide Web ROne of the facilities or services provided by certain of the computers on the Internet RA logical network of web pages that need not be on physically connected computers ROne of the facilities or services provided by certain of the computers on the Internet RA logical network of web pages that need not be on physically connected computers

79 79 http://www.ksg.harvard.edu/ http://www.president.harvard.edu/ http://www.news.harvard.edu/gazette/… http://www.harvard.edu http://www.brighamandwomens.org/PressReleases/… http://www.harvard.edu

80 80 Request “www.google.com” Receive html code Your computer Google’s computer URL = Uniform Resource Locator The Internet

81 81 Searching the Web RFinding pages referring to the search terms RDeciding which pages are the most “relevant” RFinding pages referring to the search terms RDeciding which pages are the most “relevant”

82 82 Finding Relevant Pages 1.Build an index ahead of time Eddington URL, URL, … Edison URL, URL, … Edmonton URL, URL, … 2.When queried, look up in the index

83 83 Building the Index RGoogle “crawls” the entire Web, following links and loading the pages they point to REvery time it retrieves a page, it Rindexes everything on the page Rmaybe keep a “cached” copy of the page RA complete crawl probably takes a week or two ROpt-out RCaching and copyrights? RGoogle “crawls” the entire Web, following links and loading the pages they point to REvery time it retrieves a page, it Rindexes everything on the page Rmaybe keep a “cached” copy of the page RA complete crawl probably takes a week or two ROpt-out RCaching and copyrights?

84 84 Search = lookup + ranking RLook a term up in a huge index to retrieve a set of URLs ROr several terms … RRank the results in order of “usefulness” or “desirability” ROr political correctness! RTry “falun gong” on google.com and google.cn ROr profitability?? RLook a term up in a huge index to retrieve a set of URLs ROr several terms … RRank the results in order of “usefulness” or “desirability” ROr political correctness! RTry “falun gong” on google.com and google.cn ROr profitability??

85 85 Basic Structure of the Index Eddington URL, URL, … Edison URL, URL, … Edmonton URL, URL, … Eddington Edison Edmonton Primary Memory Secondary Memory The LexiconThe Lists of Pages

86 86 Page Ranking RHugely important commercially RPage rank is really a new kind of capital RPeople try to “spoof” ranking algorithms RSearch engineers try to detect and discount spoofing REndless game of cat and mouse … RHugely important commercially RPage rank is really a new kind of capital RPeople try to “spoof” ranking algorithms RSearch engineers try to detect and discount spoofing REndless game of cat and mouse …

87 87 “A page is important if a lot of pages point to it” Probably wrong. Also easy to spoof

88 88 “A page is important if a lot of important pages point to it” RCircular? RNot really. Can calculate a consistent meaning of “importance” where every page’s importance is the sum of the importance of the pages pointing to it RLike scholarly citations of scholarly papers RCircular? RNot really. Can calculate a consistent meaning of “importance” where every page’s importance is the sum of the importance of the pages pointing to it RLike scholarly citations of scholarly papers

89 89 Did we mention that searches are logged? RGoogle Analytics: for marketing RTo help tune the search engine RBut many searches are personally identifiable! RGoogle Analytics: for marketing RTo help tune the search engine RBut many searches are personally identifiable!

90 90 The AOL search data release

91 91 What should happen?


Download ppt "1 Harvard University CSCI E-2a Life, Liberty, and Happiness After the Digital Explosion 4: Search."

Similar presentations


Ads by Google