Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Computer Science 1000 Information Searching Permission to redistribute these slides is strictly prohibited without permission.
The Values of a Link for Search Engine Optimization.
Searching on the WWW The Google Phenomena Snyder p
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
How Search Engines Work Source:
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Search Engine Optimization March 23, 2011 Google Search Engine Optimization Starter Guide.
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
The Further Mathematics network
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Search Engine Optimization (SEO) Week 07 Dynamic Web TCNJ Jean Chu.
Adding metadata to web pages Please note: this is a temporary test document for use in internal testing only.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Slide No. 1 Searching the Web H Search engines and directories H Locating these resources H Using these resources H Interpreting results H Locating specific.
Computer Science 1000 Information Searching I Permission to redistribute these slides is strictly prohibited without permission.
آموزش طراحی وب سایت جلسه پانزدهم – بهینه سازی برای موتور جستجو تدریس طراحی وب برای اطلاعات بیشتر تماس بگیرید تاو شماره تماس: پست.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
The Business Model and Strategy of MBAA 609 R. Nakatsu.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
COMP106 Assignment 2 Proposal 1. Interface Tasks My new interface design for the University library catalogue will incorporate all of the existing features,
1 Search Engine Basics Mr. Shaw. 2 Search Engine Basics Following is simplified tutorial on search engine basics. Following is simplified tutorial on.
Web Searching. How does a search engine work? It does NOT search the Web (when you make a query) It contains a database with info on numerous Web sites.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Lecture 4 Title: Search Engines By: Mr Hashem Alaidaros MKT 445.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Business Model of Google MBAA 609 R. Nakatsu.
1 Internet Research Third Edition Unit A Searching the Internet Effectively.
Search Engines By: Faruq Hasan.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Internet Research – Illustrated, Fourth Edition Unit A.
Lawrence Snyder University of Washington, Seattle © Lawrence Snyder 2004.
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
By Pamela Drake SEARCH ENGINE OPTIMIZATION. WHAT IS SEO? Search engine optimization (SEO) is the process of affecting the visibility of a website or a.
Search Engine Optimization Presented By:- ARKA Softwares Effective! Affordable! Time Groove
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
Session 5: How Search Engines Work. Focusing Questions How do search engines work? Is one search engine better than another?
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Search Engine Optimization
How do Web Applications Work?
Dr. Frank McCown Comp 250 – Web Development Harding University
Chapter Five Web Search Engines
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
BTEC NCF Dip in Comp - Unit 15 Website Development Lesson 04 – Search Engine Optimisation Mr C Johnston.
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Prepared by Rao Umar Anwar For Detail information Visit my blog:
SEARCH ENGINE OPTIMIZATION. P RESENTATION O VERVIEW  Search Engine Basics  What is SEO?  Key Concepts  Why is Search Engine marketing important? 
Objective % Explain concepts used to create websites.
Search Engines & Subject Directories
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Internet Research Third Edition
What is a Search Engine EIT, Author Gay Robertson, 2017.
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Search Engines & Subject Directories
Search Engines & Subject Directories
Objective Explain concepts used to create websites.
Best Digital Marketing Tips For Quick Web Pages Indexing Presented By:- Abhinav Shashtri.
Presentation transcript:

Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission

Search Engine a collection of computer programs designed to help us find information on the Web typically served through a website different search providers exist, but basic functionality is consistent type keywords into a text box page returns links to other pages

Search Engine why is a search engine like an index? recall that an index maps keywords to a location in some medium (like a page number in a book) a search engine does a very similar thing takes keywords of interest from a user maps these keywords to relevant web pages in fact, one of the key components of a search engine is its index

Search Engine what differentiates a search engine from other indexes (like a book index)? the ability to quickly combine keywords in searches e.g. search for information on ducks and foxes result ranking personalization among others …

Search Engine – How it Works different search engines employ different technologies the full details of commercial search engines are typically not public however, some of the basics are consistent crawling indexing query processing

Crawling for a search engine to be able to link to a web page, it must know about its existence search engines find pages by crawling the web programs called crawlers or spiders e.g. Googlebot a crawler visits web pages, in much the same way that you do as each page is visited, information is remembered about the page (indexing)

Crawling – Todo List the todo list is a list of pages that are visited by the crawler the crawling process starts with an initial to-do list, populated with sites from previous crawls however, the list is updated as the crawl takes place hyperlinks on visited sites are added to the list Todo List

Crawling – Example suppose that this page was being processed by a crawler Kev's Page Favorite Stuff: New York Islanders Saskatchewan Roughriders John Deere as a consequence of this page being crawled, its links would be added to the todo list (if they aren't already there) those pages would subsequently be checked by the crawler at some point

The "Invisible Web" not all information is crawled, which means it are not visible to search engines some pages are new, and haven't yet had a chance to be crawled however, there are other reasons that certain information does not get crawled

The "Invisible Web" 1) No hyperlinks to that page recall that in order for a page to be crawled, it must be: on the todo list be linked to a page that appears on the todo list without a hyperlink, that page will never be found Page 1 Page 2 Page 3 Page 1 Page 4 Page 2 Page 3 Page 6 Page 4 Page 5 Page 6 Todo ListWeb pages Page 5 will not be crawled, as it is not on the to-do list, and no other pages link to it.

The "Invisible Web" 2) The Page is synthetic a synthetic page is created on demand, depending on user input e.g. the results of a search on another search engine My personal search for "New York Islanders" on Bing results in an on-demand page that is not stored. Hence, it will not be crawled.

The "Invisible Web" 3) The content is unreadable to the crawler search engines are primarily text-based certain data, such as movie content, is not crawlable The webpage containing the movie might be crawled, but not the movie itself.

The "Invisible Web" 4) The content is password-protected if you require a password to access a page, then so does a search engine*

The "Invisible Web" 5) You ask the search engine to ignore your site the presence of certain files stored with your website will restrict your site from being crawled e.g. The Robots Exclusion Protocol a file called robots.txt can be stored that will request that your site (or just certain pages) are not indexed unlike the previous four examples, this does not prevent search engines from crawling your site they can choose to ignore robots.txt User-agent: Google Disallow: User-agent: * Disallow: / Example:

Indexing the primary role of the crawler is to build an index an index is a list of tokens words phrases (not considered here)* each token is associated with a list of URLs in other words, like a book index, but with page URLs instead of page numbers other information might be stored with URLs (e.g. page location of token) these indexes are saved by the search provider search queries use information from the indexes (fast), rather than crawling the web for each query (slow) *

Index Lists – Example * from text – Figure number might be different

Indexing – What Makes a Token? page text a common approach search providers differ on which text is selected* some may use all text others may only use certain text, such as: titles and headings frequently occuring words words occuring early in a page sometimes, stop words (a, an, the) are ignored hyperlink text the term from a hyperlink on another page may be used to describe the page that it links to *

Query Processing the part of the search engine that we see the query processor: reads words/phrases from the user interface returns pages that are relevant to that query modern query processors: are extremely fast are very accurate allow a considerable variety in their capabilities how does this all work?

Query Processing – How it works let's start simple: suppose we search for a single word (e.g. cat) in a nutshell: the search engine finds the list for the token 'cat' contains list of pages that contain 'cat' in the appropriate text (e.g. title) this list is ranked according to perceived relevance the ranked list is returned as an ordered set of hyperlinks

Query Processing – How it works Step 1: the search engine finds the list for the token 'cat'

Query Processing – How it works Step 2: this list is ranked according to perceived relevance en.wikipedia.org/wiki/Cat

Query Processing – How it works Step 3: the ranked list is returned as an ordered set of hyperlinks en.wikipedia.org/wiki/Cat

Query Processing what about multi-word searching? as mentioned, some search engines index phrases as well however, what if a particular phrase is not indexed? e.g. (text) red fish guppy solution: intersecting queries the webpages that are common to all of the search words are returned

Intersecting Queries example (text): suppose the query was “red fish guppy” further suppose that the indexes for each word were as follows: result is the set of sites that contain all of the keywords in other words, the sites that are found on all three lists red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu guppy: en.wikipedia.org/wiki/guppy red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu guppy: en.wikipedia.org/wiki/guppy Result:

Intersecting Queries - Efficiency the size of index lists can be large 'cat' returns over 2.3 billion results modern search engines are fast hence, clever algorithms must be developed for optimizing queries example: intersecting queries

Intersecting Queries - Efficiency suppose you had two search terms e.g. red and fish the query processor has a list for tokens suppose each list contained 1 billion tokens let's consider a method for performing the intersecting query that is, how do we find all pages that occur on both lists?

The Naive Approach for each entry in the 'red' list search through the entire 'fish' list if we find the entry from the red list, then add that to our result red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result:

The Naive Approach First search: do we find it in second list? yes – add it to result red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result:

The Naive Approach Second search: en.wikipedia.org/wiki/red do we find it in second list? no red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result:

The Naive Approach Third search: newsroom.urc.edu do we find it in second list? yes, add it to list red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result: newsroom.urc.edu

The Naive Approach Fourth search: do we find it in second list? no red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result: newsroom.urc.ed

The Naive Approach Fifth search: do we find it in second list? yes – add it to list red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result: newsroom.urc.edu

The Naive Approach problems? slow!! for each URL in left list, we potentially had to compare it to every URL in right list under our previous assumption (billion size lists), we have to do 1 billion x 1 billion comparisons even for a powerful computer, this would require a considerable amount of time

Alphabetized Lists suppose that each list was maintained alphabetically then we could employ the following approach place a marker at start of each list if markers point to same URL: add URL to result list move both markers down otherwise, move the marker whose URL is lexicographically smaller stop when at least one marker goes off the end of the list

The Sorted Approach place markers at the start of each list red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result: red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu

The Sorted Approach do markers point to same URL? no since right marker's URL is less than left marker's URL, move right marker down red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result: red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu

The Sorted Approach do markers point to same URL? no since left marker's URL is less than right marker's URL, move left marker down red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result: red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu

The Sorted Approach do markers point to same URL? yes add URL to result move both markers red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result: newsroom.urc.edu red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu

The Sorted Approach do markers point to same URL? no since right marker's URL is less than left marker's URL, move right marker down red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result: newsroom.urc.edu red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu

The Sorted Approach do markers point to same URL? yes add URL to result move both markers red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result: newsroom.urc.edu red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu

The Sorted Approach do markers point to same URL? no since left marker's URL is less than right marker's URL, move left marker down red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result: newsroom.urc.edu red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu

The Sorted Approach do markers point to same URL? yes add URL to result move both markers red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result: newsroom.urc.edu red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu

The Sorted Approach at least one marker has completed its list, so we can stop notice that our result contains correct values red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu result: newsroom.urc.edu red: en.wikipedia.org/wiki/red newsroom.urc.edu fish: en.wikipedia.org/wiki/fish newsroom.urc.edu

The Sorted Approach how many comparisons are done? note that every step involves moving at least one arrow hence, the maximum number of steps is 2 billion this is considerably less than (1 billion) squared result: a massive speedup

The Sorted Approach – Notes remember: commercial search engines don't fully publicize strategies hence, some search engines may use alternate approaches for efficient intersections the previous strategy applies to more than two lists simultaneously hence, we can search for multiple tokens, rather than just two

Example (from text):

Ranking Results a typical search can produce millions of results however, we often find what we are looking for in the first few results according to Optify, first returned result from Google gets clicked 36.4% of time first page gets clicked through 90% of the time how does this occur? via a page ranking system

Ranking Results search providers have different ways of ranking the results of the search Google: PageRank proprietary (not all details available) some details are public (considered next) the higher the PageRank score, the closer to the top of the search results a page will be

PageRank a scoring system links from other pages add to a page's score Page 1 Page 4 Page 5 Page 2 Page 5 Page 6 Page 3 Page 5 Page 6 Page 4 Page 5 Page 6 Web pages the link from Page 1 adds to Page 4's score the links from Pages 1,2,3 add to Page 5's score the links from Page 2 and 3 add to Page 6's score

PageRank the score from each page is not weighted equally the higher a page's PageRank, the more important its contribution is Page 1 Page 3 Page 2 Page 4 Page 3 Page 4 Web pages suppose that Page 3 has one link (Page 1), and Page 4 has one link (Page 2) since Page 2's rank is higher than Page 1's, then Page 4's rank will be higher than Page 3's High Rank Low Rank

PageRank – Notes since a page is not necessarily aware of other pages that point to it, its PageRank must be computed by the crawler PageRank is only part of the ranking process that you see Google uses over 200 factors to determine page relevancy PageRank is one of those factors others include location, language, personalization, etc.