Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Searching on the WWW The Google Phenomena Snyder p
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
(c) Maria Indrawan Distributed Information Retrieval.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Chapter 5: Information Retrieval and Web Search
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Internet Research Search Engines & Subject Directories.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Lecturer: Ghadah Aldehim
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Courtney Forsmann IT Help Desk Manager Lewis-Clark State College October 1, 2014.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Searching Tutorial By: Lola L. Introduction:  When you are using a topic, you might want to use “keyword topics.” Using this might help you find better.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Search Tools and Search Engines Searching for Information and common found internet file types.
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
SEARCH ENGINES The World Wide Web contains a wealth of information, so much so that without search facilities it could be impossible to find what you were.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
DATA MINING Introductory and Advanced Topics Part III – Web Mining
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information Retrieval
Data Mining Chapter 6 Search Engines
Information Retrieval and Web Design
Presentation transcript:

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have to deal with often are: –How to locate information on the web? –What is the quality of the information located?

Search Engines2 Searching the Web Web is different from traditional information sources: –Size: about a billion pages. –Average page size: 5-10k –Textual data: tens of terabytes. –Around 2000 size of web was doubling every 2 years. –40% of the pages change daily in.com domain –In 10 days half the pages are gone. Traditional information retrieval methods cannot be used.

Search Engines3 Introduction Two main approaches to searching for information on the web have evolved: –Directories, Search Engines and Meta-Search Engines. Directories organize information on the web into a hierarchy of topics and subtopics. Search Engines allow users to submit a query and use it to search their databases. Meta-Search Engines submits a query to more than one search engine.

Search Engines4 Web Directories Hyperlinks to web pages organized as hierarchy of topics and sub-topics. Directories can be general or specialized. Very easy to use. User does not need to know exactly what he is searching for. Specialized directories build by experts in the subject.

Search Engines5 Search engines are computer programs that does the following: –Accepts query from the user. –Searches its database to match the query. –Collects and returns page URLs containing information that matches the query. –Permits to revise and resubmit a query. Search engine can be general or specialized.

Search Engines6 Meta Search Engines A meta-search engine calls more than one search engine to do the searching. E.g Search results are collated into one list or presented separately. Advantage: can access many engines with single query. Disadvantage: many non-interesting pages.

Search Engines7 Querying Search Engine... Pattern matching query: a keyword or a group of keywords  engine returns URLs of pages “containing these words”. Example: ice hockey, “ice hockey” Words like a, an, the, of, etc. these are ignored. +a, +an etc. are not. “University of New Hampshire” and “men’s ice hockey”. Stemming. Wildcards.

Search Engines8 Search Engine: Working Search engines perform the following basic tasks: –They search the internet based on important words. –They keep an index of words they find and where they find them. –They allow users to look for words or combination of words found in the index. Engines index billions of pages and respond to tens of millions user queries per day. Before programs like Gopher, Archie kept indexes of files stored in the servers.

Search Engines9 Search Engine: Working Search Engine consists of the following components: –User interface: user’s type in query and search results are displayed. –Searcher: searches the database –Page Ranker: assigns relevancy scores to the information retrieved from the database.

Search Engines10 Search Engine: Working Search engine’s database is built with the following components: –Gatherer (also called spider, worm, crawler): traverses the web to collect information. –Indexer: classifies data gathered by the gatherer and creates an index.

Search Engines11

Search Engines12 Gatherer or Spider Multiple spiders (3 or 4 or more) browse the web downloading pages into the page repository. For example: a very early version of Google, using 4 spiders would crawl 100 pages per sec., generating 600 Kb/sec. Spiders start with a set of URLs to visit and download pages from. Spiders extract URLs in the downloaded pages and pass them on to a control module.

Search Engines13 Gatherer or Spider Control module determines which URLs the spider should visit next. Use Breadth First or Depth First Search. Sometimes a web-site does not want a spider to access and index its pages  this is indicated in the meta tags. Spiders task is never complete…they go on crawling.

Search Engines14 Gatherer or Spider There are a number of issues to be considered: –Which pages to crawl and download? –Which pages to refresh after downloading them and at what frequency? –How should the load on a web-site be minimized? –How should the crawling process be parallelized?

Search Engines15 Indexer Pages collected must be indexed. Simplest way to index: (word, urls where the word was found). Problem with this approach: no way to tell how the word was used on the page, importantly or trivially, just once or many times  ranking pages becomes difficult. Actually, more information is stored with a word e.g. no. of times it occurs, a weight depending on where it was found (word in title given higher weight) etc.

Search Engines16 Indexer Indexers note the words on the page and where the words were found. Some spiders may ignore the common words like, ‘a’, ‘an’, ‘the’ etc. (Google); some do not ignore the common words (AltaVista). Some keep track of the words in the title, sub- headings and links, along with the 100 most frequently used words on the page and each word in the first 20 lines of text.

Search Engines17 Indexer Words occurring in the title, subtitles, meta tags and other important positions are given special consideration.

Search Engines18 Indexer Indexers build at least (i) text index (ii) structure index or link index…keeping in view the key problem “to find pages most relevant to the query.” Text index (or inverted index): (index term, sorted list of locations) Location: page id + location on page + other info. about the occurrence (e.g font, heading, anchor text, etc.  payload)

Search Engines19 Indexer Structure index (or link index): built by viewing the web as a directed graph: pages as nodes and hyperlinks as edges. For each page incoming and outgoing links are stored  neighborhood info. This information is also used by ranking algorithms.

Search Engines20 Indexer The index term + the additional information is encoded into a bit string to save space, in the text index. Also in other indexes. Google early version used 2 byte code. Challenges: How handle billion pages index? How to deal with index rebuilds?

Search Engines21 Ranking Needed because: –Web is vast…pages found containing the query words may be poor quality & not relevant. –Pages not self descriptive. E.g. query words “search engine” do not yield home pages of common search engines because they do not contain the words “search engine”. –Spamming.

Search Engines22 Ranking Link structure of the web contains important information that can be used to filter or rank web pages. A link from page A to page B can be considered as a recommendation of page B by author of A.  Page with many links pointing to it should get higher ranking. Two algorithms based on this: (i) PageRank, (ii) HITS.

Search Engines23 Ranking: PageRank Importance of a page based on the “number of pages on the web pointing to it.” Yahoo! homepage more important than KSU homepage. Idea proposed by Lawrence Page and Sergy Brin … creators of Google.

Search Engines24 Ranking: PageRank Therefore, Rank(Page P) = No. of other pages pointing to it. Too simple: spamming a problem, a number of pages can be created artificially to point to a page. PageRank extends the basic idea: considers the importance of pages pointing to a given page.  thus a page more important if Yahoo! points to it

Search Engines25 Ranking: PageRank Simple PageRank: Let pages on the web be 1, 2, …, m. N(i): No. of outgoing links. B(i): No. of incoming links. PageRank of page i:

Search Engines26 Ranking: HITS HITS: hypertext induced topic search. Does not assign a global rank to every page – HITS algorithm is query dependent. Authority pages & hub pages. Assigns two scores: authority score & hub score. Basic idea is to identify a small sub-graph of the web (depending on the user query) and apply link analysis to it to locate the authorities and hubs.

Search Engines27 Ranking: HITS The algorithm is a two part algorithm: –Identifying the focused sub-graph. –Performing link analysis on it. Focused subgraph generated by forming a root set R  obtained from text index. R = {a random set of pages containing the given query string} Focused set = { R + pages in the neighbourhood of R}

Search Engines28 Ranking: HITS Algorithm HITS: 1. R  set of t pages that contain the query terms. 2. S  R. 3. For each page p  R (a) Include maximum of d pages that p points to in S. 4. Graph induced is the focused sub-graph.

Search Engines29 Ranking: HITS This algorithm takes query string, t and d as input parameters. t  limits the size of the root set. d  limits the number of pages added to sub-graph.

Search Engines30 Ranking: HITS Link analysis: identifies the hubs & authorities from the expanded set S. Let the pages in the focused sub-graph S be 1, 2, …, n. B(i)  set of pages that point to page i. F(i)  set of pages that the page i points to. Algorithm produces a i and h i for each page in S, from initial arbitrary values of a i and h i

Search Engines31 Ranking: HITS Each iteration performs two steps: I and O. I step: O step: The scores are normalized with: Algorithm repeats until the values of a i and h i converge.