Presented by: Vanshika Sharma

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
GOOGLE SEARCH ENGINE Presented By Richa Manchanda.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
Modeling and Optimizing Hypertextual Search Engines Based on the Reasearch of Larry Page and Sergey Brin Gus Johnson Search EnginesModified.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
Modeling and Optimizing Hypertextual Search Engines Based on the Reasearch of Larry Page and Sergey Brin Yunfei Zhao Department of Computer Science University.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
SEO Friendly Website Building a visually stunning website is not enough to ensure any success for your online presence.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google PageRank Algorithm
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
The anatomy of a Large-Scale Hypertextual Web Search Engine.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
General Architecture of Retrieval Systems 1Adrienn Skrop.
CS 440 Database Management Systems Web Data Management 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
The Anatomy Of A Large Scale Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Search Search Engines Search Engine Optimization Search Interfaces
Anatomy of a search engine
CS 440 Database Management Systems
Data Mining Chapter 6 Search Engines
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
The Search Engine Architecture
Presentation transcript:

Presented by: Vanshika Sharma 04-08-2015 The Anatomy of a Large-Scale Hypertextual Web Search Engine Based on the Research of Lawrence Page and Sergey Brin Presented by: Vanshika Sharma 04-08-2015 Search Engines

How Search Engine Works Crawling and indexing the web System Features Outline Introduction Design Goals Terms and Definition System Architecture How Search Engine Works Crawling and indexing the web System Features Conclusions Final Exam Questions Search Engines

1. Introduction As the volume of information available to the public increases exponentially, it is crucial that data storage, management, classification, ranking, and reporting techniques improve as well. The purpose of this paper is to discuss how search engines work and what modifications can potentially be made to make the engines work more quickly and accurately. Finally, we want to ensure that our optimizations we induce will be scalable, affordable, maintainable, and reasonable to implement. Search Engines

Introduction Cont’d. 1997 Google.com is registered as a domain on September 15. The name—a play on the word "googol," a mathematical term for the number represented by the numeral 1 followed by 100 zeros—reflects Larry and Sergey's mission to organize a seemingly infinite amount of information on the web. Search Engines

Web Search Engines -- Scaling Up: 1994 – 2015 Year Engine Index of web pages Queries per day 1994 WWWW 110,000 1500 1997 Top search engines (Altavista) 2 million to 100 million 20 million 1998/ 2000/2015 Google 26 million/1 billion/ 30 trillion (indexed in 1000 Tb) 3.5 billion Search Engines

2. Search Engine Design Goals Scalability with web growth Improved Search Quality Decrease number of irrelevant results Incorporate feedback systems to account for user approval Too many pages for people to view: some heuristic must be used to rank sites' importance for the users. Improved Search Speed Even as the domain space rapidly increases Take into consideration the types of documents hosted -number of web pages is constantly growing, but users are still only willing to look at the first ~10 results -allow for effective academic research for large-scale research projects -less is more when it comes to search queries Search Engines

The Significance of SEO's Too many sites for humans to maintain ranking Humans are biased: have different ideas of what "good/interesting" and "bad/boring" are. With a search space as a large as the web, optimizing order of operations and data structures have huge consequences. Concise and well developed heuristics lead to more accurate and quicker results Different methods and algorithms can be combined to increase overall efficiency. Search Engines

What Makes Ranking Optimization Hard Link Spamming Keyword Spamming Page hijacking and URL redirection Intentionally inaccurate or misleading anchor text Accurately targeting people's expectations Search Engines

3. Terms and Definitions Search Engines

Terms and Definitions, Cont'd Search Engines

4. How Search Engines Work First the user inputs a query for data. his search is submitted to a back-end server. Search Engines

How Search Engines Work, Cont'd The server uses regular expressions to parse the user's inquiry for data. The strings submitted can be permuted and rearranged to test for spelling errors. (specifics on Google's querying will be shown later) The search engine searches it's database for documents which closely relate to the user's input. In order to generate meaningful results, the search engine utilizes a variety of algorithms which work together to describe the relative importance of any specific search result. Finally, the engine returns results back to the user. - The size of the World Wide Web (The Internet) The Indexed Web contains at least 4.72 billion pages (Monday, 06 April, 2015). Search Engines

5. Google's Infrastructure Overview Google's architecture includes 14 major components: an URL Server, multiple Web Crawlers, a Store Server, a Hypertextual Document Repository, an Anchors database, a URL Resolver, a Hypertextual Document Indexer, a Lexicon, multiple short and long Barrels, a Sorter Service, a Searcher Service, and a PageRank Service. These systems were implemented in C and C++ on Linux and Solaris systems. Search Engines

Infrastructure Part I Search Engines

Infrastructure Part II Search Engines

Infrastructure Part III Search Engines

6. URL Resolving and Web Crawling Before a search engine can respond to user inquiries, it must first generate a database of URLs (or Uniform Resource Locators) which describe where web servers (and their files) are located. The URL Server's job is to keep track of URL's that have and need to be crawled. In order to obtain a current mapping of web servers and their file trees, Google's URL Server routinely invokes a series of web crawling agent called Googlebots. Web users can also manually request for their URL's to be added to Google's URLServer. Search Engines

URL Resolving and Web Crawling Cont’d Web Crawlers: When a web page is 'crawled' it has been effectively downloaded. Googlebots are Google's web crawling agents/scripts (written in python) which spawn hundreds of connections (approximately 300 parallel connections at once) to different well connected servers in order to, "build a searchable index for Google's search engine." (wikipedia). Brin and Page commented that DNS (Domain Name Space) lookups were an expensive process. Gave crawling agents DNS caching abilities. Googlebot is known as a well-behaved spider: sites avoid crawling by adding <metaname = "Googlebot“ content = "nofollow"> to the head of the doc (or by adding a robots.txt file) Search Engines

Indexing Indexing the Web involves three main things: Parsing: Any parser which is designed to run on the entire Web must handle a huge array of possible errors. e.g. non-ASCII characters and typos in HTML tags. Indexing Documents into Barrels: After each document is parsed, every word is assigned a wordID. These words and wordID pairs are used to construct an in-memory hash table (the lexicon). Once the words are converted into wordID's, their occurrences in the current document are translated into hit lists and are written into the forward barrels. Sorting: the sorter takes each of the forward barrels and sorts it by wordID to produce an inverted barrel for title and anchor hits, and a full text inverted barrel. This process happens one barrel at a time, thus requiring little temporary storage. Search Engines

Searching During the time the paper was written, Google queries returned 40,000 results. Search Engines

Google Query Evaluation 1. Query is parsed 2. Words are converted into wordIDs 3. Seek to the start of the doclist in the short barrel for every word. 4. Scan through the doclists until there is a document that matches all the search terms. 5. Compute the rank of that document for the query. 6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. 7. If we are not at the end of any doclist go to step 4. 8. Sort the documents that have matched by rank and return the top k. Search Engines

7. System Features- 1.Link Analysis and Anchors Hypertext links are convenient to users and represent physical citations on the Web. Anchor Text Analysis: <a href = "http : //www.google.com" >Anchor Text</a> Can be more accurate description of target site than target site’s text itself Can point at non-HTTP or non-text; such as images, videos, databases, pdf's, ps's, etc. Also, anchors make it possible for non-crawled pages to be discovered. -video of michael jordan will have something related as anchor text Search Engines

7. System Features: 2 PageRank Rights belong to Google, patent belongs to Stanford University Top 10 IEEE ICDM data mining algorithm Algorithm used to rank the relative importance of pages within a network. PageRank idea based on the elements of democratic voting and citations. The PR Algorithm uses logarithmic scaling; the total PR of a network is 1. Search Engines

Introduction to PageRank PageRank is a link analysis algorithm that ranks the relative importance of all web pages within a network. It does this by looking at three web page features: 1. Outgoing Links - the number of links found in a page 2. Incoming Links - the number of times other pages have cited this page 3. Rank - A value representing the page's relative importance in the network -e has second most incoming links. Links come from less important sites so E’s pagerank is much lower than B’s Search Engines

Introduction to PageRank Simplified PageRank Initialize all pages to PR = 1 𝑁 . This gives all pages the same initial rank in the network of N pages. The page rank for any page u can be computed by: 𝑃𝑅 𝑢 = 𝑣∈ 𝐵 𝑢 𝑃𝑅(𝑣) 𝐿(𝑣) Where 𝐵 𝑢 is the set containing all pages linking to page 𝑢 Search Engines

Calculating Naïve PageRank 𝑷𝑹 𝑨 =(𝟏−𝒅)+𝒅 𝑷𝑹 𝑩 𝑳 𝑩 + 𝑷𝑹 𝑪 𝑳 𝑪 + … PR(A) = The PageRank of page A C(A) or L(A) = the total number of outgoing links from page A d = the damping factor Even an imaginary randomly clicking surfer will stop eventually. Usually set to d = 0.85 The probability that a user will start at a random page at any given step. Search Engines

Calculating Naive PageRank, Cont'd The PageRank of a page A, denoted PR(A), is decided by the quality and quantity of sites linking or citing it. Every page Ti that links to page A is essentially casting a vote, deeming page A important. By doing this, Ti propagates some of it's PR to page A. How can we determine how much importance an individual page Ti gives to A? Ti may contain many links not just a single link to page A. Ti must propagate it's page rank equally to it's citations. Thus, we only want to give page A a fraction of the PR(Ti ). The amount of PR that Ti gives to A is be expressed as the damping value times the PR(Ti ) divided by the total number of outgoing links from Ti . Search Engines

Naive Example Search Engines

Calculating PageRank using Linear Algebra Typically PageRank computation is done by finding the principal eigenvector of the Markov chain transition matrix. The vector is solved using the iterative power method. Above is a simple Naive PageRank setup which expresses the network as a link matrix. More examples can be found at: http://www.math.uwaterloo.ca/~hdesterc/websiteW/Data/presentations/pres2008/ChileApr2008.pdf (Fun Linear Algebra!) http://www.webworkshop.net/pagerank.html http://www.sirgroane.net/google-page-rank/ -write out B matrix -B(v) = lambda(B) -- Search Engines

Calculating PageRank using Linear Algebra, Cont'd For those interested in the actual PageRank Calculation and Implementation process (involving heavier linear algebra), please view "Additional Resources" slide. Search Engines

Disadvantages and Problems Rank Sinks: Occur when pages get in infinite link cycles. Spider Traps: A group of pages is a spider trap if there are no links from within the group to outside the group. Dangling Links: A page contains a dangling link if the hypertext points to a page with no outgoing links. Dead Ends: are simply pages with no outgoing links. Solution to all of the above: By introducing a damping factor, the figurative random surfer stops trying to traverse the sunk page(s) and will either follow a link randomly or teleport to a random node in the network. Search Engines

Curious Facts In 1999, it took Google one month to crawl and build an index of about 50 million pages. In 2012, the same task was accomplished in less than one minute. 16% to 20% of queries that get asked every day have never been asked before. Every query has to travel on average 1,500 miles to a data center and back to return the answer to the user. A single Google query uses 1,000 computers in 0.2 seconds to retrieve an answer. Search Engines

Scalable Architecture 8. Conclusion High Quality Search Information can be found easily. Provide higher quality search, PageRank allows google to evaluate the quality of web pages Scalable Architecture It must be efficient in both space and time. Google’s major data structures make efficient use of available storage space. Crawling, indexing and sorting operations are efficient to build an index of 24 million pages in less than one week. Research tool The data google has collected has already resulted in many other papers submitted to conferences and many more on the way. Search Engines

9. Final Exam Questions (1) Please state the PageRank formula and describe it's components PR(A) = The PageRank of page A C(A) or L(A) = the total number of outgoing links from page A d = The damping factor. Search Engines

Final Exam Questions (2) Disadvantages and problems of PageRank? Rank Sinks: Occur when pages get in infinite link cycles. Spider Traps: A group of pages is a spider trap if there are no links from within the group to outside the group. Dangling Links: A page contains a dangling link if the hypertext points to a page with no outgoing links. Dead Ends: are simply pages with no outgoing links. Search Engines

Final Exam Questions (3) What Makes Ranking Optimization Hard? Link Spamming Keyword Spamming Page hijacking and URL redirection Intentionally inaccurate or misleading anchor text Accurately targeting people's expectations Additional Information - http://en.wikipedia.org/wiki/Spamdexing Search Engines

Additional Resources http://cis.poly.edu/suel/papers/pagerank.pdf - PR via The SplitAccumulate Algorithm, Merge-Sort, etc. http://nlp.stanford.edu/ manning/papers/PowerExtrapolation.pdf -PR via Power Extrapolation: includes benchmarking http://www.webworkshop.net/pagerank_calculator.php - neat little tool for PR calculation with a matrix http://www.miislita.com/information-retrieval-tutorial/ [...] matrix-tutorial-3-eigenvalues-eigenvectors.html Search Engines

Bibliography http://www.math.uwaterloo.ca/ hdesterc/ websiteW/Data/presentations/pres2008/ChileApr2008.pdf Infrastructure Diagram and explanations from last year's slides Google Query Steps from last year's slides http://portal.acm.org/citation.cfm?id=1099705 http://www.springerlink.com/content/ 60u6j88743wr5460/fulltext.pdf?page=1 http://www.ianrogers.net/google-page-rank/ http://www.seobook.com/microsoft-search- browserank-research-reviewed http://www.webworkshop.net/pagerank.html http://en.wikipedia.org/wiki/PageRank http://pr.efactory.de/e-pagerank-distribution.shtml http://www.cs.helsinki.fi/u/linden/teaching/irr06/ drafts/petteri huuhka google draft.pdf http://www-db.stanford.edu/ backrub/pageranksub.ps Search Engines