Download presentation
Presentation is loading. Please wait.
Published byEarl Brown Modified over 9 years ago
1
1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering University of British Columbia (UBC) Vancouver, BC, CANADA 2004 © Emre A. Yavuz. EECE, UBC
2
2 What is Google ? A fully automated search engine, which employs robots known as “spiders” to crawl the web frequently and find sites for inclusion in the Google database or index. 2004 © Emre A. Yavuz. EECE, UBC
3
3 Some Google Factoids Named for the mathematical term “googol” or 10 100,the number represented by the numeral 1 followed by 100 zeros. Global unique users per month: 81.9 million. Selected by Yahoo (2000) and AOL (2002) as search engine partner. Indexes largest amount of Internet accessible documents. Designed to scale well to extremely large data sets Efficient usage of storage space to store the index. Optimized data structures for fast and efficient access. 2004 © Emre A. Yavuz. EECE, UBC
4
4 Who invented it, when and why ? In early 90s, search engines started springing out of academic projects. Low quality of the results and existence of poorly designed search engines prepared the born of Google. Designed and created by Sergie Brin and Larry Page. On September 7, 1998, Google Inc. opened its doors in a garage in Menlo Park, California. 2004 © Emre A. Yavuz. EECE, UBC
5
5 How does Google Work ? When you perform a Google search, you are not actually searching the web, but rather an index of the copy of the web stored on Google’s servers. The index is compiled from all the pages that have been returned by a multitude of spiders – called GoogleBot - that crawl the web. When a user types in a query, the search items are looked up in the index and the results are then returned from a separate set of document servers along with advertisement. All of these bits are assembled, with the help of its PageRank technology, into the page of search results. 2004 © Emre A. Yavuz. EECE, UBC
6
6 What is PageRank ? The method of measuring a page’s “importance”. The applied version of academic citation literature to the web. An extended idea based on the counted citations or backlinks to a given page by not counting links from all pages equally, and by normalizing the number of links on a page. Assuming page A having pointing pages to itself labeled from t1 to tn, the pagerank of page A is given as follows: PR(A) = (1-d) + d. (PR(t1)/C(t1) + … + PR(tn)/C(tn)) where C(A) is defined as the # of links going out of page A. 2004 © Emre A. Yavuz. EECE, UBC
7
7 How to tell what a PageRank of a page is Download a toolbar from http://toolbar.google.com.http://toolbar.google.com Once installed, there will be bar graph at the top of the browser showing a version of PageRank for the page being browsed. Hold the mouse over the bar to see a number from 0 to 10. Only to give you an idea, not very accurate, sometimes guesses, if the page entered is not in indexed, but there is a closer one. Just a representation of actual PageRank. Whilst PageRank is linear, Google uses a non-linear graph to portray it. 2004 © Emre A. Yavuz. EECE, UBC
8
8 How significant is PageRank ? The significance of any factor in search engine algorithms depends on the quality of the information it supplies. A factor’s importance is known as its weight. Originally, when the Meta keyword tag was new, it could be used as an indicator of what the page was about. However, the weighting was fast approaching nothing since it was easily abused by the Webmasters with a high level of manipulation. Even though PageRank is harder to be manipulated, it is not impossible to do. 2004 © Emre A. Yavuz. EECE, UBC
9
9 Is PageRank enough to determine the quality of a page (1)? “People only link to pages they think are good.” However, there may be other reasons like: Reciprocal links – “Link to me and I’ll link you.” Link requirements – “Using our script requires you to put a link to our website.” or “We’ll give you an award in return for a link to our website.” Friends and family – “This is my friend Pete’s site” Free Page Add-ons – “This counter is provided by …” 2004 © Emre A. Yavuz. EECE, UBC
10
10 Is PageRank enough to determine the quality of a page (2)? If a Webmaster picks the outbound links by searching on Google, then PageRank itself will have an influence on the number of links to a page, (in a circular way). Thus the links will no longer be based solely on human judgement and the increase will not be solely because it is a good page, but because its PageRank is already high. Therefore, PageRank is not enough to produce high precision results. 2004 © Emre A. Yavuz. EECE, UBC
11
11 Other System Features Title tag – most important factor since high level of importance is placed by most engines & directories. Proximity of search terms – how often do they appear ? How close together are they ? Text characteristics – font size and type, search terms in a larger or bolder font are weighted higher than others. Anchor text – Anchors often provide more accurate descriptions of web pages than the pages themselves. They may exist for documents which can not be indexed by a text based search engine – images, programs, databases etc. 2004 © Emre A. Yavuz. EECE, UBC
12
12 The difference between PageRank and other factors Title TagCan only be listed once Keywords in Body textEach successive repetition is less important. Proximity is important. Anchor textHighly weighted, but like keywords in body text, there is a cutoff point where further anchor text is no longer worthwhile PageRankPotentially infinite. You are always capable of increasing your PageRank significantly, but it takes work. 2004 © Emre A. Yavuz. EECE, UBC
13
13 How does Google rank pages ? Find all pages matching the keywords of the search. Rank accordingly using “on the page factors” such as keywords bolded, relatively larger etc. Calculate the inbound anchor text. Adjust the results by PageRank scores. 2004 © Emre A. Yavuz. EECE, UBC
14
14 System Anatomy (1) Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux. URLserver sends list of URLs to be fetched to the crawlers. The fetched web pages are sent to the storeserver to be compressed and stored into a repository. Every webpage has an associated ID number called a docID. The indexer reads the repository, uncompresses the documents and parses them to be converted into a set of word occurrences called hits. 2004 © Emre A. Yavuz. EECE, UBC
15
15 High Level Google Architecture 2004 © Emre A. Yavuz. EECE, UBC
16
16 System Anatomy (2) The hits record the word, position, fontsize and capitalization. The indexer distributes these hits into a set of barrels and parses out all the links in every webpage and stores important information about them in an anchors file. The URLresolver reads the anchors file and converts relative URLs into absolute URLs and docIDs. The sorter takes the barrels, sorted by docID and resorts them by wordID. It also produces a list of wordIDs and offsets into the inverted index. 2004 © Emre A. Yavuz. EECE, UBC
17
17 System Anatomy (3) A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon. The searcher is run by a webserver and uses the lexicon together with the inverted index and the PageRank to answer queries. 2004 © Emre A. Yavuz. EECE, UBC
18
18 How does Google make money ? Initially, sold targeted banner advertisements and provided search services to other websites including Yahoo. Later, launched AdWords – a system for automatically selling and displaying advertisements alongside search results. The ads are also ranked according to their popularity. Using the base created by AdWords, launched a context targeted advertisement system – AdSense. Google “next generation corporate software” – released on 2 nd of June 04, query and document update software. 2004 © Emre A. Yavuz. EECE, UBC
19
19 How do you maximize your place on Google ? (1) Make sure that all your pages are indexed in the first place. Pay a great deal of attention to your webpage titles. Have keywords well-represented in the body of the webpage. Add content to your pages and to your website, Google likes sites with lots of content. Use keywords as hyperlink names. 2004 © Emre A. Yavuz. EECE, UBC
20
20 How do you maximize your place on Google ? (2) Have a good system of navigation between your webpages, PageRank gets passed among the internal links of a website. Get external links to as many pages on your site as you can. Each external link will add to the PageRank not only of the page that is linked, but also of every webpage on your site, if you have good site navigation. Do not submit a redirection web page. Most search engines will skip your web site completely in that case. Try to avoid using frames in your web site. 2004 © Emre A. Yavuz. EECE, UBC
21
21 References “The Anatomy of a Large Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrence Page. “PageRank Uncovered”, Chris Ridings and Mike Shishigin. “Google! Everything you always wanted to know, but didn’t have time to find out”, Judy Broom, Betsy Chessler and Katherine Foster. And not surprisingly http://www.google.comhttp://www.google.com 2004 © Emre A. Yavuz. EECE, UBC
22
THANKS Questions ? 2004 © Emre A. Yavuz. EECE, UBC
23
23 Some Features of Google (1) daterange: limits your search to a particular date or range of dates that a page was indexed by Google. only works with Julian dates, so you’ll need to find a Julian date converter online. The Julian date must be an integer (no decimals.) Usage daterange:start - stop e.g. stjohns daterange:2452401-2452766 2004 © Emre A. Yavuz. EECE, UBC
24
24 Some Features of Google (2) filetype: restricts your results to files ending in ".doc" (or.xls,.ppt. etc.), and shows you only files created with the corresponding program. The “dot” in the file extension –.doc – is optional. filetype:extension e.g. stjohns -filetype:pdf 2004 © Emre A. Yavuz. EECE, UBC
25
25 Some Features of Google (3) inanchor: restricts the results to text in a page’s link anchors. inanchor:terms e.g. stjohns -inanchor:”ubc” intext: ignores link text, URLs, and titles, and only searches body text, helps you find query words that are too common in URLs and links. intext:terms e.g.stjohns -intext:”ubc.ca” 2004 © Emre A. Yavuz. EECE, UBC
26
26 Some Features of Google (4) intitle: restricts the results to documents containing a particular word in its title. inurl: restricts the results to documents containing a particular word in its URL. site: restricts the results to those websites in a domain. cache: shows the version of a web page that Google has in its cache. 2004 © Emre A. Yavuz. EECE, UBC
27
27 Some Features of Google (5) link: restricts the results to those web pages that have links to the specified URL. related: lists web pages that are "similar" to a specified web page. info: presents some information that Google has about a particular web page. 2004 © Emre A. Yavuz. EECE, UBC
28
28 Some Features of Google (6) There are actually three different Google phonebook operators. Using phonebook: searches the entire Google phonebook. Using rphonebook: searches residential listings only. Using bphonebook: searches business listings only. 2004 © Emre A. Yavuz. EECE, UBC
29
29 Some Features of Google (7) If you begin a query with stocks: Google will treat the rest of the query terms as stock ticker symbols, and will link to a Yahoo finance page showing stock information for those symbols. If you begin a query with define: Google will display definitions for the word or phrase that follows, if definitions are available. 2004 © Emre A. Yavuz. EECE, UBC
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.