Download presentation
Presentation is loading. Please wait.
Published byPercival McCoy Modified over 9 years ago
1
1 Harvard University CSCI E-2a Life, Liberty, and Happiness After the Digital Explosion 4: Search
2
2 google.com google.cn baidu.cn
3
3
4
4 ARPAnet, 1971
5
5 Clients and Servers Client Computers Web Server www.google.com e-mail Server Mail.yahoo.com e-mail Server smtp.fas.harvard.edu downloadupload THE INTERNET
6
6 IP = Internet Protocol Store and Forward Switch = Router RRouter in network core receives incoming packets and stores them in “buffer” (temporary storage) RRoutes packets on outgoing links RMay throw packets away if buffer is full RRouter in network core receives incoming packets and stores them in “buffer” (temporary storage) RRoutes packets on outgoing links RMay throw packets away if buffer is full Routing Table
7
7 “End to End”: Intelligence at Edge of Network RRouters are relatively dumb and rely on intelligence at the edge to compensate Packetize Add serial #s Add fingerprint Add destination address Insert into network BEST EFFORT Reassemble packets (Maybe) report missing packets (Maybe) report damaged packets Deliver to application Client application: email, web browser, iTunes Server application
8
8 Packets RPacket size (1.5 KB max) a compromise RSmall enough that they can be “handled” quickly and with relatively low odds of being damaged RLarge enough that packaging does not outweigh the contents or “payload” RPacket size (1.5 KB max) a compromise RSmall enough that they can be “handled” quickly and with relatively low odds of being damaged RLarge enough that packaging does not outweigh the contents or “payload”
9
9 IP Addresses IPv4: 32 bits written as 4 decimal numerals less than 256, e.g. 141.211.125.22 (UMich) R4 billion not enough RIPv6: 128 bits written as 8 blocks of 4 hex digits each, e.g. AF43:23BC:CAA1:0045:A5B2:90AC:FFEE:8080 At edge, translate URLs --> IP addresses, e.g. umich.edu --> 141.211.125.22 RAuthoritative sites for address translation = “Domain Name Server” (DNS) In the network core, IP addresses are used to route packets using routing tables IPv4: 32 bits written as 4 decimal numerals less than 256, e.g. 141.211.125.22 (UMich) R4 billion not enough RIPv6: 128 bits written as 8 blocks of 4 hex digits each, e.g. AF43:23BC:CAA1:0045:A5B2:90AC:FFEE:8080 At edge, translate URLs --> IP addresses, e.g. umich.edu --> 141.211.125.22 RAuthoritative sites for address translation = “Domain Name Server” (DNS) In the network core, IP addresses are used to route packets using routing tables
10
10
11
11 But who controls the names and numbers? RICANN = Internet Corporation for Assigned Names and Numbers RA US nonprofit … but it’s a long story. RICANN = Internet Corporation for Assigned Names and Numbers RA US nonprofit … but it’s a long story.
12
12
13
13 The Internet is IP RRouters do not know what the bits in the packets represent RDo not know if they are email, streaming video, html web pages RDo not know if they are encrypted or unencrypted RYou can invent your own new service adhering to IP standards RGain Internet’s best-effort service Rand possibility of undelivered packets RRouters do not know what the bits in the packets represent RDo not know if they are email, streaming video, html web pages RDo not know if they are encrypted or unencrypted RYou can invent your own new service adhering to IP standards RGain Internet’s best-effort service Rand possibility of undelivered packets
14
14 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 1234
15
15 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 1 234
16
16 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 12 34
17
17 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 123 4
18
18 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 1234
19
19 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 1234
20
20 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 1234
21
21 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 1 234
22
22 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 12 34
23
23 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 123 4
24
24 Striping RSmallish packets also make better use of the network since later packets can leave before earlier packets arrive 1234
25
25 Striping Utilizes the Network RStore and Forward delays would add up if entire message had to be buffered at every router 1234
26
26 Striping Utilizes the Network RStore and Forward delays would add up if entire message had to be buffered at every router 1234
27
27 Striping Utilizes the Network RStore and Forward delays would add up if entire message had to be buffered at every router 1234
28
28 Striping Utilizes the Network RStore and Forward delays would add up if entire message had to be buffered at every router 1234
29
29 Striping Utilizes the Network RStore and Forward delays would add up if entire message had to be buffered at every router 1234
30
30 Striping Utilizes the Network RStore and Forward delays would add up if entire message had to be buffered at every router 1234
31
31 TCP Transport Control Protocol RCreates logical connection between two machines on the edge of the network RConnected machines seem to have a circuit connecting them even though they do not tie up the network RProvide reliable, perfect transport of messages, even though IP may drop packets RRegulates the rate at which packets are inserted into the network RCreates logical connection between two machines on the edge of the network RConnected machines seem to have a circuit connecting them even though they do not tie up the network RProvide reliable, perfect transport of messages, even though IP may drop packets RRegulates the rate at which packets are inserted into the network
32
32 TCP, Basic Idea
33
33 TCP, Basic Idea 1 2
34
34 TCP, Basic Idea 1 2 + “3-Way Handshaking”
35
35 TCP, Basic Idea 1 2 + “3-Way Handshaking”
36
36 TCP, Basic Idea 1 2 + “3-Way Handshaking”
37
37 TCP, Basic Idea 1 2 + “3-Way Handshaking”
38
38 TCP, Basic Idea 1 2 + “3-Way Handshaking”
39
39 TCP, Basic Idea 1 2 + “3-Way Handshaking”
40
40 TCP, Basic Idea 1 2 * “3-Way Handshaking”
41
41 TCP, Basic Idea 1 2 * “3-Way Handshaking”
42
42 TCP, Basic Idea 1 2 * “3-Way Handshaking”
43
43 TCP, Basic Idea 1 2 * “3-Way Handshaking”
44
44 TCP, Basic Idea 1 2 * “3-Way Handshaking”
45
45 TCP, Basic Idea 1 2 * “3-Way Handshaking”
46
46 TCP, Basic Idea 1 2 * “3-Way Handshaking”
47
47 TCP, Basic Idea 1 2 * “3-Way Handshaking”
48
48 TCP, Basic Idea 1 2 * “3-Way Handshaking”
49
49 TCP, Basic Idea 1 2 * “3-Way Handshaking”
50
50 TCP, Basic Idea 1 2 * “3-Way Handshaking”
51
51 TCP, Basic Idea 1 2 * “3-Way Handshaking”
52
52 TCP, Basic Idea 1 2 “Virtual Circuit” now established between two hosts though the routers in between are not aware of it and the same path need not be followed by all packets
53
53 TCP, Basic Idea 11 2
54
54 TCP, Basic Idea 11 2
55
55 TCP, Basic Idea 1 212
56
56 TCP, Basic Idea 11 22
57
57 TCP, Basic Idea 1 212
58
58 TCP, Basic Idea 11 22
59
59 TCP, Basic Idea ACK1 1 2 1 2
60
60 TCP, Basic Idea ACK1 1 2 1 2
61
61 TCP, Basic Idea ACK1 1 2 12 ACK2
62
62 TCP, Basic Idea ACK1 1 1 2 ACK2
63
63 TCP, Basic Idea ACK1 1 2 1 ACK2 2
64
64 TCP, Basic Idea 1 2 2 ACK2
65
65 TCP, Basic Idea 2 21 ACK2
66
66 TCP, Basic Idea 2 21 ACK2
67
67 TCP, Basic Idea 12
68
68 TCP, Basic Idea 12
69
69 Dropped Packets and Retransmission 11 2
70
70 Dropped Packets and Retransmission 11 2
71
71 Dropped Packets and Retransmission 11 22
72
72 Dropped Packets and Retransmission 1 22
73
73 Dropped Packets and Retransmission 1 22
74
74 Dropped Packets and Retransmission 1 22
75
75 Dropped Packets and Retransmission 1 22
76
76 Dropped Packets and Retransmission 1 22
77
77 Dropped Packets and Retransmission 1 21 2 TIMEOUT
78
78 The World Wide Web ROne of the facilities or services provided by certain of the computers on the Internet RA logical network of web pages that need not be on physically connected computers ROne of the facilities or services provided by certain of the computers on the Internet RA logical network of web pages that need not be on physically connected computers
79
79 http://www.ksg.harvard.edu/ http://www.president.harvard.edu/ http://www.news.harvard.edu/gazette/… http://www.harvard.edu http://www.brighamandwomens.org/PressReleases/… http://www.harvard.edu
80
80 Request “www.google.com” Receive html code Your computer Google’s computer URL = Uniform Resource Locator The Internet
81
81 Searching the Web RFinding pages referring to the search terms RDeciding which pages are the most “relevant” RFinding pages referring to the search terms RDeciding which pages are the most “relevant”
82
82 Finding Relevant Pages 1.Build an index ahead of time Eddington URL, URL, … Edison URL, URL, … Edmonton URL, URL, … 2.When queried, look up in the index
83
83 Building the Index RGoogle “crawls” the entire Web, following links and loading the pages they point to REvery time it retrieves a page, it Rindexes everything on the page Rmaybe keep a “cached” copy of the page RA complete crawl probably takes a week or two ROpt-out RCaching and copyrights? RGoogle “crawls” the entire Web, following links and loading the pages they point to REvery time it retrieves a page, it Rindexes everything on the page Rmaybe keep a “cached” copy of the page RA complete crawl probably takes a week or two ROpt-out RCaching and copyrights?
84
84 Search = lookup + ranking RLook a term up in a huge index to retrieve a set of URLs ROr several terms … RRank the results in order of “usefulness” or “desirability” ROr political correctness! RTry “falun gong” on google.com and google.cn ROr profitability?? RLook a term up in a huge index to retrieve a set of URLs ROr several terms … RRank the results in order of “usefulness” or “desirability” ROr political correctness! RTry “falun gong” on google.com and google.cn ROr profitability??
85
85 Basic Structure of the Index Eddington URL, URL, … Edison URL, URL, … Edmonton URL, URL, … Eddington Edison Edmonton Primary Memory Secondary Memory The LexiconThe Lists of Pages
86
86 Page Ranking RHugely important commercially RPage rank is really a new kind of capital RPeople try to “spoof” ranking algorithms RSearch engineers try to detect and discount spoofing REndless game of cat and mouse … RHugely important commercially RPage rank is really a new kind of capital RPeople try to “spoof” ranking algorithms RSearch engineers try to detect and discount spoofing REndless game of cat and mouse …
87
87 “A page is important if a lot of pages point to it” Probably wrong. Also easy to spoof
88
88 “A page is important if a lot of important pages point to it” RCircular? RNot really. Can calculate a consistent meaning of “importance” where every page’s importance is the sum of the importance of the pages pointing to it RLike scholarly citations of scholarly papers RCircular? RNot really. Can calculate a consistent meaning of “importance” where every page’s importance is the sum of the importance of the pages pointing to it RLike scholarly citations of scholarly papers
89
89 Did we mention that searches are logged? RGoogle Analytics: for marketing RTo help tune the search engine RBut many searches are personally identifiable! RGoogle Analytics: for marketing RTo help tune the search engine RBut many searches are personally identifiable!
90
90 The AOL search data release
91
91 What should happen?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.