February 17, 20111
2
There is no practical obstacle whatever now to the creation of an efficient index to all human knowledge, ideas and achievements, to the creation, that is, of a complete planetary memory for all mankind. And not simply an index; the direct reproduction of the thing itself can be summoned to any properly prepared spot. … This in itself is a fact of tremendous significance. It foreshadows a real intellectual unification of our race. The whole human memory can be, and probably in a short time will be, made accessible to every individual. H. G. Wells (1937) February 17, 20113
One of the facilities or services provided by certain of the computers on the Internet A logical network of web pages that need not be on physically connected computers February 17, 20114
February 17, Request “ Receive html code Your computer Harvard’s computer URL = Uniform Resource Locator The Internet
February 17, We know where you are!
8February 25, 2010
… search companies log your searches … February 17, 20119
February 22,
February 17,
Finding pages referring to the search terms Deciding which pages are the most “relevant” February 17,
1. Build an index ahead of time February 17, EddingtonURL, URL, … EdisonURL, URL, … EdmontonURL, URL, … 2.When queried, look up in the index
Google “crawls” the entire Web, following links and loading the pages they point to Every time it retrieves a page, it indexes everything on the page maybe keep a “cached” copy of the page A complete crawl probably takes a week or two Opt-out Caching and copyrights? February 17,
Primary storage: Silicon memory chips Up to a gigabit or more Random-access: same time for any datum February 17,
February 17,
Seek delay Rotational latency February 17,
Primary: approaching 1 ns = sec Secondary: seek time 5 ms = 5·10 -3 sec Secondary is (5·10 -3 )/10 -9 = 5 million times slower Imagine a bookshelf is primary memory and getting a book takes 10 sec Getting book from secondary storage would take more than a year and a half February 17,
February 17,
Works only if items are in order same amount of time to access any item Then it takes at most lg n steps to find an item in a table of length n. E.g. n = 1 billion => lg n steps = 30 steps February 17,
February 17, EddingtonURL, URL, … EdisonURL, URL, … EdmontonURL, URL, … Eddington Edison Edmonton Primary Memory Secondary Memory The LexiconThe Lists of Pages
Many, many tricks to compress both the index and the lists of URLs Notes show how a lexicon with 25 million entries might fit in 16GB of primary storage The lists of URLs might be vastly greater but OK as long as it takes only one disk access to get back a lot of URLs February 17,
Hugely important commercially Page rank is really a new kind of capital People try to “spoof” ranking algorithms Search engineers try to detect and discount spoofing Endless game of cat and mouse … February 17,
February 17, Probably wrong. Also easy to spoof
February 22,
Circular? Not really. Can calculate a consistent meaning of “importance” where every page’s importance is the sum of the importance of the pages pointing to it Like scholarly citations of scholarly papers February 17,
February 17,
Web surfing metric If you wander the web at random, how likely are you to wind up at a given page? Page A is more higher ranked than page B if you are more likely to wind up at A during a completely random meandering through the web February 17,
Mission: “to organize the world's information and make it universally accessible and useful.” Brin: “The perfect search engine would understand exactly what you mean and give back exactly what you want” February 25,
February 25,
February 25,