Download presentation
Presentation is loading. Please wait.
Published byDavid Cox Modified over 9 years ago
1
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture
2
How Google Works Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneously, significantly speeding up data processing. Google search has three distinct parts: 1. Googlebot, a web crawler that finds and fetches web pages. 2. The indexer that sorts every word on every page and stores the resulting index of words in a huge database. 3. The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.
3
1. Googlebot, Google’s Web Crawler Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer. Googlebot finds pages in two ways: 1.through an add URL form, www.google.com/addurl.htmlwww.google.com/addurl.html 2.through finding links by crawling the web.
4
Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s index database. This index is sorted alphabetically by search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. This data structure allows rapid access to documents that contain user query terms. stop words: Google ignores (doesn’t index) common words called stop words (such as the, is, on, or, of, how, why, as well as certain single digits and single letters), some punctuation and multiple spaces. 2. Google’s Indexer
5
The query processor has several parts: 1.including the user interface (search box) 2.the “engine” that evaluates queries and matches them to relevant documents 3.the results formatter PageRank: is Google’s system for ranking web pages. A page with a higher PageRank is deemed more important and is more likely to be listed above a page with a lower PageRank. Google considers over a hundred factors in computing a PageRank and determining which documents are most relevant to a query, including: popularity of the page the position and size of the search terms within the page the proximity of the search terms to one another on the page 3. Google’s Query Processor
6
GoogleBot techniques deep crawling technique : When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for subsequent crawling. Because of their massive scale, deep crawls can reach almost every page in the web. Because the web is vast, this can take some time, so some pages may be crawled only once a month. fresh crawls : To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current.
7
deceiving tactics Google rejects those URLs submitted through its Add URL form that it suspects are trying to deceive users by employing tactics such as: including hidden text or links on a page stuffing a page with irrelevant words (Keyword stuffing) Meta tag stuffing cloaking using sneaky redirects creating doorways, domains, or sub-domains with substantially similar content sending automated queries to Google and linking to bad neighbors cloaking: refers to any of several means to serve a page to the search-engine spider that is different from that seen by human users. code swapping: optimizing a page for top ranking and then swapping another page in its place once a top ranking is achieved.
8
Gateway or Doorway pages Doorway pages are Web pages designed and built specifically to draw search engine visitors to your website. They are standalone pages designed only to act as doorways to your site.
9
Google’s Query diagram
10
Results Page For the sake of efficiency, Google searches only the first 101 kilobytes (approximately 17,000 words) of a web page and the first 120 kilobytes of a pdf file.
11
Cached Pages Google takes a snapshot of each page it examines and caches (stores) that version as a back-up. The cached version is what Google uses to judge if a page is a good match for your query. This is useful if the original page is unavailable because of: Internet congestion A down, overloaded, or just slow website The owner’s recently removing the page from the Web Note: Since Google’s servers are typically faster than many web servers, you can often access a page’s cached version faster than the page itself.
12
Cached Pages Note: Google indexes a page (adds it to its index and caches it) frequently if the page is popular (has a high PageRank) and if the page is updated regularly. The new cached version replaces any previous cached versions of the page.
13
News Headlines When Google finds current news relating to your query, Google includes up to three headlines that link to news stories above your search results.
14
با سپاس از توجه شما دانشکده فنی مهندسی دانشگاه بیرجند زمستان 1387
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.