Chapter 2: How Search Engines Work
Chapter Objectives Describe the PageRank formula for calculating a webpage’s popularity. Determine how a search engine would calculate the relevance of a webpage to a keyword. Describe the kinds of websites that were rewarded and penalized by the Google Panda and Google Penguin updates.
Yahoo Lists
Larry Page American computer scientist and internet entrepreneur who co-founded Google Inc. with Sergey Brin,computer scientistinternet entrepreneurGoogle Inc.Sergey Brin CEO of Google's parent company, Alphabet Inc. After stepping aside as CEO in August 2001 in favour of Eric Schmidt, Page re-assumed the role in April He announced his intention to step aside a second time in July 2015 to become CEO of Alphabet, under which Google's assets would be reorganized. Under Page, Alphabet is seeking to deliver major advancements in a variety of industries. [3] Alphabet IncEric Schmidt [3] Page is the inventor of PageRank, Google's best-known search ranking algorithmPageRankalgorithm Google makes up almost 70% of search engine market share.
Search Engine Parts From Google’s white paper Indexer-Barrels-Sorter portion is key Pagerank no longer used, but this structure is still relatively accurate Black-Hat search engine optimization attempts to artificially inflate a page’s ranking
Crawling Crawling=browses World Wide Web typically for web indexing Find new and updated web content – URL Server tracks pages – Crawler explores all links to find new pages (no need to submit as it happens automatically) URL Server must prioritize crawling – Crawlers are fast, but with limits (usually once/week) – Frequently updated content will be crawled more often (news sites) – Can be problematic
Caching HTML code of webpage sent to repository – Google has cached copy of entire world wide web – Cache = temporary storage (In google storage so if website is down, Google knows what is there or was there as a snapshot)
Indexing Recodes each web page as a “hit list” – A “hit” is a word occurrence (not to be confused with a web hit, when someone views a web page) – Each page indexed as a series of words docID: wordID:21548nhits: 5hit1hit2hit3hit4hit5 wordID:18975nhits: 5hit1hit2hit3hit4hit5 wordID:87916nhits: 3hit1hit2hit3... wordID: 48985nhits: 1hit1 Cap: 0, font: 3, position: 173
Storing Hit Lists Partially sorts hits – docID sent to barrel corresponding to wordID – Some duplication of docID’s – Prepares docID’s for re-sorting by wordID
Sorting Hit lists sorted by docID are not searchable – Must sort by wordID – Search engine results must find all docIDs that use the searched-for word wordid:21548docID: nhits:5hit1hit2hit3hit4hit5 docID: nhits:2hit1hit2 docID: nhits:6hit1hit2hit3hit4hit5hit6... docID: nhits:4hit1hit2hit3hit4 wordid:18975docID: nhits:5hit1hit2hit3hit4hit5... docID: nhits:3hit1hit2hit3
Analyzing Links Links used for multiple purposes – Crawling – Creating list of webpages (docIDs) – Calculating relevance – Calculating PageRank No longer used Many link metrics still used
Searching on Google Searcher types “metamorphosis” into Google – All docIDs containing wordID found – Relevance score for each docID calculated – PageRank of each webpage (docID) found – Relevance and PageRank combined to determine final rankings
Calculating Relevance Hit TypeType Weight URL100 Anchor Text90 Title Tag100 Plain text large font60 Plain text medium font30 Plain text small font10 Note: When looking just at Relevance, some sites with little useful content can earn good rankings if set up properly.
Calculating Relevance – Hit TypeType WeightNo. of Hits URL1001 Anchor Text9052 Title Tag1001 Plain text large font 601 Plain text medium font 307 Plain text small font *1 + 90* *1 + 60*1 + 30*7 + 10*37 = 5520
Calculating Relevance – Hit TypeType WeightNo. of Hits URL1001 Anchor Text9036 Title Tag1001 Plain text large font 601 Plain text medium font 302 Plain text small font *1 + 90* *1 + 60*1 + 30*2 + 10*25 = 3810
Count-Weights To inflate score, a webmaster could repeat “metamorphosis” 100 times at the bottom of the page (in white font to make it invisible to users— keyword stuffing) Count-weights prevent high scores from repeated use CountHit 1Hit 2Hit 3Hit 4Hit 5Hit 6Hit 7Hit 8Hit 9+ Weight Count-Weight Adjusted Relevance Score Metamorphosis820 The Metamorphosis751
Multi-Word Searches butterfly metamorphosis – “butterfly” – “metamorphosis” – “butterfly metamorphosis” Much easier to earn good rankings for multiple- word searches
Perform a Google Search Examine top 3 organic results – Analyze usage of the words you searched in each webpage (relevance) – Analyze PageRank of each webpage using or – – Determine what actions the #3 ranked site should take to become ranked #1