Search Xin Liu.

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Fluency with Information Technology Third Edition by Lawrence Snyder Chapter.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Overview of Search Engines
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:
How Search Engines Work General Search Strategies Dr. Dania Bilal IS 587 SIS Fall 2007.
Presented By: - Chandrika B N
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
Chapter 5 Searching for Truth: Locating Information on the WWW.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Lawrence Snyder University of Washington, Seattle © Lawrence Snyder 2004.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Information Retrieval in Practice
The PageRank Citation Ranking: Bringing Order to the Web
The Anatomy Of A Large Scale Search Engine
Google and Scalable Query Services
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Hongjun Song Computer Science The University of Memphis
Anatomy of a search engine
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Web Search Engines.
The Search Engine Architecture
Presentation transcript:

Search Xin Liu

Searching the Web for Information How a Search Engine Works Basic parts: Crawler: Visits sites on the Internet, discovering Web pages Indexer: building an index to the Web's content Query processor: Looks up user-submitted keywords in the index and reports back which Web pages the crawler has found containing those words Popular Search Engines: Google, Yahoo!, Bing, Walfram|alpha, etc. Walfram is a systematic knowledge immediately computable by anyone. Enter your question or calculation and Wolfram|Alpha uses its built-in algorithms and a growing collection of data to compute the answer. Based on a new kind of knowledge-based computing Sin(x)*sin(x) + cos(x) * cos(x) Int cos(x) What is the gdp of france? Internet users of Europe LDL 150 Life expedancy of male 25 American http://www.wolframalpha.com/ a good introduction video.

Text-based Knowledge based Natural Language Automated search Human intervention

Source The Anatomy of a Large-Scale Hypertextual Web Search Engine, http://infolab.stanford.edu/~backrub/google.html How Internet Search Engines Work http://www.howstuffworks.com/search-engine.htm

System Anatomy Source: The Anatomy of a large-scale Hypertextual web search engine by Sergey Brin and Lawrence Page

Architecture Overview A URLServer sends lists of URLs to be fetched by a set of crawlers Fetched pages are given a docID and sent to a StoreServer which compresses and stores them in a repository The indexer extracts pages from the repository and parses them to classify their words into hits The hits record the word, position in document, and approximation of font size and capitalization. These hits are distributed into barrels, or partially sorted indexes It also builds the anchors file from links in the page, recording to and from information. The URLResolver reads the anchors file, converting relative URLs into absolute

Architecture Overview The forward index is updated with docIDs the links point to. The links database is also created as pairs of docIDs. This is used to calculate the PageRank The sorter takes barrels and sorts them by wordID (inverted index). A list of wordIDs points to different offsets in the inverted index This list is converted into a lexicon (vocabulary) The searcher is run by a Web server and uses the lexicon, inverted index and PageRank to answer queries

Repository Repository: Contains the actual HTML compressed 3:1 using open-source zlib. Stored like variable-length data (one after the another) Independent of other data structures Other data structures can be restored from here

Index Document Index: Indexed sequential file (ISAM: Indexed sequential file access mode) with status information about each document. The information stored in each entry includes the current document status, a pointer to the repository, a document checksum and various statistics. Forward Index: Sorting by docID. Stores docIDs pointing to hits. Stored in partially sorted indexes called “barrels”. Inverted Index: Stores wordIDs and references to documents containing them.

Lexicon Lexicon: Or list of words, is kept on 256MB of main memory, allocating 14 million words and hash pointers. Lexicon can fit in memory for a reasonable price. It is a list of words and a hash table of pointers.

Hit lists Hit Lists: Records occurrences of a word in a document plus details. Position, font, and capitalization information Accounts for most of the space used. Fancy hits: URL, title, anchor text, <meta> Plain hits: Everything else Details are contained in bitmaps:

Crawlers When a crawler visits a website: First identifies all the links to other Web pages on that page Checks its records to see if it has visited those pages recently If not, adds them to list of pages to be crawled Records in an index the keywords used on a page Different pages need to be visited in different frequency

Crawler How to find/update all pages? How fast to update? News, blogs, regional news Wiki UCD homepage Personal homepage More documents Images, videos, maps Offline Current classification from google News, blogs, regional news (real-time) Wiki (semi-real-time) UCD homepage (core websites) Personal homepage (long tail, lots of them, less important each individually) , google can crawl very fast, that is not a problem, but must hold on, so that it does not affect the website traffic too much (e.g., if 50% traffic comes from search engine, that is too much)

Building the index Like the index listings in the back of a book: for every word, the system must keep a list of the URLs that word appears in. Relevance ranking: Location is important: e.g., in titles, subtitles, the body, meta tags, or in anchor text Some ignore insignificant words (e.g., a, an), some try to be complete Meta tag is important. A search engine index is much like the index listings in the back of a book: for every word, the system must keep a list of the pages that word appears in. It's quite hard to store these efficiently. For example, the word "flamingo" appears in about 492,000 out of about 2 billion pages known to Google. For the average size of 1,000 words per page, they have to be very careful to use techniques such as storing information in RAM: it would take 8 months to check for that word if everything was on disk. Google knows about 3 billion web documents, including images, PDF and other file formats, Usenet newsgroups and news. Once Google has matched a word in the index, it wants to put the best document first. It chooses the best document using a number of techniques: - Text analysis: evaluating the documents based on matching words, font size, proximity, and over 100 other factors - Links & link text: external links are a somewhat independent guide to what's on page. - PageRank: a query-independent measure of the quality of both pages and sites. This is an equation that tries to indicate how often a truly random searcher, following links without any thought, would end up at a particular page. Google has a "wall" between the search relevance ranking and advertising: no one can pay to be the top listing in the results, but the sponsored link spots are available.

Indexing the Web Parser: Must be validated to expect and deal with a huge number of special situations. Typos in HTML tags, Kilobytes of zeros, non-ASCII characters. Indexing into barrels: Word > wordID > update lexicon > hit lists > barrels. Sorting: Inverted index is generated from forward barrels

Indexing the Web Parser: Must be validated to expect and deal with a huge number of special situations. Typos in HTML tags, Kilobytes of zeros, non-ASCII characters. Indexing into barrels: Word > wordID > update lexicon > hit lists > forward barrels. There is high contention for the lexicon. Sorting: Inverted index is generated from forward barrels

Query Processor Ranking the results PageRank: query-independent Relevance: query-dependent font, position and capitalization are used. Give users what they want, not what they type Spelling correction Delivering localized results globally All done in 0.25s!

Query Processors Gets keywords from user and looks them up in its index Even if a page has not yet been crawled, it might be reported because it is linked to a page that has been crawled, and the keywords appear in the anchor on the crawled page

Page ranking PageRank is a weighted voting system: a link from page A to page B as a vote, by page A, for page B Weighted by the importance of the voting pages. You can see the page rank in the browser. Try: cnn.com, yahoo.com, Compare: ucdavis.edu, cs.ucdavis.edu PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important." Important, high-quality sites receive a higher PageRank, which Google remembers each time it conducts a search. Of course, important pages mean nothing to you if they don't match your query. So, Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines all aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query.

Page Ranking Google's idea: PageRank Orders links by relevance to user Relevance is computed by counting the links to a page (the more pages link to a page, the more relevant that page must be) Each page that links to another page is considered a "vote" for that page Google also considers whether the "voting page" is itself highly ranked

Youtube is more popular than google as a portal.

Backlink Link Structure of the Web Every page has some number of forward links and back links. Highly back linked pages are more important than with few links. Page rank is an approximation of importance / quality

PageRank Pages with lots of backlinks are important Backlinks coming from important pages convey more importance to a page Ex: B,C,D->A PR(A)=PR(B)+PR(C)+PR(D) If B->A, B->C

Rank Sink Page cycles pointed by some incoming link Problem: this loop will accumulate rank but never distribute any rank outside Dangling nodes

Damping factor

Random Surfer Model Page Rank corresponds to the probability distribution of a random walk on the web graphs (1-d) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever

How to solve it? Iteration in practice. D=0.85 in the original paper, a tradeoff between convergence and the impact of random serfer

Convergence Calculate through iterations Does it converge? Is it unique? How fast is the convergence, if it happens?

Search Which part is the most challenging? How a Search Engine Works Basic parts: Crawler Indexer Query processor Which part is the most challenging? IN my opinion, query processor is the most challenging part. How to return the most relevant results. How to define relevant is difficult, and different for different people, at different situation. What is relevant? You know it when you see it.

Challenges Personalized search Search engine optimization Google bomb Is PageRank out of date? Google bomb Data fusion Scalability and reliability Parallelization Security and privacy Complexity of network and materials It is estimated that Google has a few millions of servers. Monthly utility bill is over several million dollars. Focusing on good processing power per dollar instead of per machine processing power Google is the largest content distribution network In Feb. 2009, there was about an hour where all links from google search are marked as “this link may cause harm to your computer. “ The terms Google bomb and Googlewashing refer to practices intended to influence the ranking of particular pages in results returned by theGoogle search engine, often with humorous or political intentions.[1] Before 2007, Google's search-rank algorithm could rank a page higher if enough other sites linked to that page using similar anchor text (linking text such as "miserable failure");[2] Ex: a news story on a murder by a roommate, the ad is on finding a roommate, that is a “bad” match. Search in other languages has different types of challenges, e.g., Chinese.

Look forward Organize the world's information and make it universally accessible and useful (Google). Compute information/knowledge (Walfram) Mobile search Voice search

Practical search tips Choosing the right terms and knowing how the search engine will use them Words or phrases? Search engines generally consider each word separately Ask for an exact phrase by placing quotations marks around it Proximity used in deciding relevance

Logical Operators AND, OR, NOT AND: Tells search engine to return only pages containing both terms Thai AND restaurants OR: Tell search engine to find pages containing either word, including pages where they both appear NOT: Excludes pages with the given word AND and OR are infix operators; they go between the terms NOT is a prefix operator; it precedes the term to be excluded

Google search tips Case does not matter Automatic “and” Washington = WASHINGtOn Automatic “and” Davis and restaurant = Davis restaurant Exclusion of common words Where/how/and, etc. does not count Enforce by + sign; e.g., star war episode +1 Enforce by phrase search; e.g., “star war episode I” Phrase search “star war episode I” Exclude Bass –music OR Restaurant Davis Thai OR Chinese Domain ECS15 site:ucdavis.edu Synonym search ~food ~facts Numrange search DVD player $50..$100 Links: http://www.google.com/help/basics.html http://www.google.com/help/refinesearch.html Upper case is important, in OR No space between – and music: in Bass –music Try ~foot ~facts

What is your favorite tip?