Download presentation
Presentation is loading. Please wait.
Published byNoel Pope Modified over 8 years ago
1
Chapter 2: How Search Engines Work
2
Chapter Objectives Describe the PageRank formula for calculating a webpage’s popularity. Determine how a search engine would calculate the relevance of a webpage to a keyword. Describe the kinds of websites that were rewarded and penalized by the Google Panda and Google Penguin updates.
3
Yahoo Lists
4
Larry Page American computer scientist and internet entrepreneur who co-founded Google Inc. with Sergey Brin,computer scientistinternet entrepreneurGoogle Inc.Sergey Brin CEO of Google's parent company, Alphabet Inc. After stepping aside as CEO in August 2001 in favour of Eric Schmidt, Page re-assumed the role in April 2011. He announced his intention to step aside a second time in July 2015 to become CEO of Alphabet, under which Google's assets would be reorganized. Under Page, Alphabet is seeking to deliver major advancements in a variety of industries. [3] Alphabet IncEric Schmidt [3] Page is the inventor of PageRank, Google's best-known search ranking algorithmPageRankalgorithm Google makes up almost 70% of search engine market share.
5
Search Engine Parts From Google’s white paper Indexer-Barrels-Sorter portion is key Pagerank no longer used, but this structure is still relatively accurate Black-Hat search engine optimization attempts to artificially inflate a page’s ranking
6
Crawling Crawling=browses World Wide Web typically for web indexing Find new and updated web content – URL Server tracks pages – Crawler explores all links to find new pages (no need to submit as it happens automatically) URL Server must prioritize crawling – Crawlers are fast, but with limits (usually once/week) – Frequently updated content will be crawled more often (news sites) – Can be problematic
7
Caching HTML code of webpage sent to repository – Google has cached copy of entire world wide web – Cache = temporary storage (In google storage so if website is down, Google knows what is there or was there as a snapshot)
8
Indexing Recodes each web page as a “hit list” – A “hit” is a word occurrence (not to be confused with a web hit, when someone views a web page) – Each page indexed as a series of words docID:2058795wordID:21548nhits: 5hit1hit2hit3hit4hit5 wordID:18975nhits: 5hit1hit2hit3hit4hit5 wordID:87916nhits: 3hit1hit2hit3... wordID: 48985nhits: 1hit1 Cap: 0, font: 3, position: 173
9
Storing Hit Lists Partially sorts hits – docID sent to barrel corresponding to wordID – Some duplication of docID’s – Prepares docID’s for re-sorting by wordID
10
Sorting Hit lists sorted by docID are not searchable – Must sort by wordID – Search engine results must find all docIDs that use the searched-for word wordid:21548docID:2058795nhits:5hit1hit2hit3hit4hit5 docID:4856187nhits:2hit1hit2 docID:4894872nhits:6hit1hit2hit3hit4hit5hit6... docID:12487561nhits:4hit1hit2hit3hit4 wordid:18975docID:2058795nhits:5hit1hit2hit3hit4hit5... docID:14879531 nhits:3hit1hit2hit3
11
Analyzing Links Links used for multiple purposes – Crawling – Creating list of webpages (docIDs) – Calculating relevance – Calculating PageRank No longer used Many link metrics still used
12
Searching on Google Searcher types “metamorphosis” into Google – All docIDs containing wordID 21548 found – Relevance score for each docID calculated – PageRank of each webpage (docID) found – Relevance and PageRank combined to determine final rankings
13
Calculating Relevance Hit TypeType Weight URL100 Anchor Text90 Title Tag100 Plain text large font60 Plain text medium font30 Plain text small font10 Note: When looking just at Relevance, some sites with little useful content can earn good rankings if set up properly.
14
Calculating Relevance – http://en.wikipedia.org/wiki/Metamorphosis http://en.wikipedia.org/wiki/Metamorphosis Hit TypeType WeightNo. of Hits URL1001 Anchor Text9052 Title Tag1001 Plain text large font 601 Plain text medium font 307 Plain text small font 1037 100*1 + 90*52 + 100*1 + 60*1 + 30*7 + 10*37 = 5520
15
Calculating Relevance – http://en.wikipedia.org/wiki/The_Metamorphosis http://en.wikipedia.org/wiki/The_Metamorphosis Hit TypeType WeightNo. of Hits URL1001 Anchor Text9036 Title Tag1001 Plain text large font 601 Plain text medium font 302 Plain text small font 1025 100*1 + 90*36 + 100*1 + 60*1 + 30*2 + 10*25 = 3810
16
Count-Weights To inflate score, a webmaster could repeat “metamorphosis” 100 times at the bottom of the page (in white font to make it invisible to users— keyword stuffing) Count-weights prevent high scores from repeated use CountHit 1Hit 2Hit 3Hit 4Hit 5Hit 6Hit 7Hit 8Hit 9+ Weight11.9.7.45.2.05.010 Count-Weight Adjusted Relevance Score Metamorphosis820 The Metamorphosis751
17
Multi-Word Searches butterfly metamorphosis – “butterfly” – “metamorphosis” – “butterfly metamorphosis” Much easier to earn good rankings for multiple- word searches
18
Perform a Google Search Examine top 3 organic results – Analyze usage of the words you searched in each webpage (relevance) – Analyze PageRank of each webpage using http://ahrefs.com or http://www.opensiteexplorer.org http://ahrefs.com http://www.opensiteexplorer.org – https://serps.com/tools/rank-checker/ https://serps.com/tools/rank-checker/ – Determine what actions the #3 ranked site should take to become ranked #1
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.