Understanding Search Engines. Basic Defintions: Search Engine Search engines are information retrieval (IR) systems designed to help find specific information.

Slides:



Advertisements
Similar presentations
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
1 Presented By Avinash Gutte Under The Guidance of Mrs. Hemangi Kulkarni Department of Computer Engineering Pimpri-Chinchwad College of Engineering, Pune.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
1 Internet Search Tools Adapted from Kathy Schrock’s PowerPoint entitled “Successful Web Search Strategies” Kathy Schrock’s complete PowerPoint available.
Databases & Data Warehouses Chapter 3 Database Processing.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Search Engines. Internet protocol (IP) Two major functions: Addresses that identify hosts, locations and identify destination Connectionless protocol.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Lecture 12 IR in Google Age. Traditional IR Traditional IR examples – Searching a university library – Finding an article in a journal archive – Searching.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Algorithmic Detection of Semantic Similarity WWW 2005.
Search Tools and Search Engines Searching for Information and common found internet file types.
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Data mining in web applications
Information Retrieval in Practice
Search Engines and Search techniques
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Methods and Apparatus for Ranking Web Page Search Results
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Data Mining Chapter 6 Search Engines
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Presentation transcript:

Understanding Search Engines

Basic Defintions: Search Engine Search engines are information retrieval (IR) systems designed to help find specific information stored in digital server and database systems. Search engines are meant to minimize both the time required to find information, and the amount of information which must be searched.

Our focus is on Web Information Retrieval, not traditional IR · Web IR means “search within the world’s largest and linked document collection.” · This document collection is growing at a rate that is almost impossible to know. · Links arise and disappear at an unknown rate.

Methods of IR and Search · Boolean Search · Vector Space Model Search · Probabilistic Model Search · Meta Search

Boolean Search · One of the earliest and simplest computerized IR methods. · Applies Boolean algebraic operations (AND, OR, NOT) to user keywords. · AND = x and y satisfied (both conditions, I) · OR= x or y condition (either condition, U) · NOT= only x, not y (specific subset, S)

Boolean Search 2 +’s ·Simple. Fast. Manageable. —’s ·Simplistic; car+maintenance≠ auto care (polysemy and synonymy) Assumes user has strong familiarity with the topic domain. ·Limited; best used for specific topics with small vocabulary.

Vector Space Model Search · Developed in the early 1960s by Gerard Salton. · Transforms text into numeric vectors and matrices, then uses matrix analysis techniques to discern features and semantic relationships.(!)

Vector Space Model 2 +’s Incredibly powerful tool for keeping track of evolving meanings and shifting vocabularies. Automatically includes relevance scores thereby returning ranked search results.(!) —’s Computationally intense; requires massive computing power and cannot scale up to deal with massive (web-sized) document sets.

Probabilistic Model Search Uses a probability model to guess which documents a user will find relevant. The key to this model’s effectiveness is the set of initial conditions. One of the most powerful initial conditions is an index of a user’s search history/search tendency. Another initial condition is the search term. Some powerful search algorithms begin by broadening the search terms to include conceptually related documents. Most appropriate for enterprises where complete understanding of an evolving topic domain or wordspace is mission critical. Grapeshot

Probabilistic Model 2 +’s Very powerful tool. Uses evolving meanings and shifting vocabularies to expand the search vectors. Cutting edge. This is the area of greatest research interest, and greatest value generation. In other words, this is where the money is. —’s When there is no history, you have to start with assumptions; that can be devastating to relevance. Very hard to build, therefore, very expensive. Like, unbelievably expensive. Megabucks.

Meta Search If one search engine is good (but has drawbacks) why not combine them?!? That’s a MetaSearch engine. Queries are sent to multiple engines, or multiple processors. As you would expect, this can be very accurate, but very slow. When they’re wrong, they’re monumentally wrong.

To make the perfect Web Search Engine, you must deal the web’s externalities: 1.You will have to search through the largest document set in the known universe. 2.That document set is changing 3.The set is self-organizing; or more accurately, the set is completely disorganized. 4.It is hyperlinked

The perfect web search engine: A Huge Document Set The web is, in fact, too big to accurately measure. JAN 2004: 10,000,000,000+ pages FEB 2007: 25,000,000,000+ pages Surface web counts, not Deep Web.

The perfect web search engine: A Changing Document Set Cho and Molina, The evolution of the Web and implications for an incremental crawler. Proceedings of the 26 th International Conference on Very Large Databases 40% of pages in sample changed w/in 7 days 23% changed w/in 24 hours * Growth rate is unknown, but significant

The perfect web search engine: A Self-Organizing Set There are no standards for content, minimal control over structure, no rules for formats. The data are volatile subject to error, dishonesty, link- rot, and file disappearance. Data exist in multiple formats; in duplicate; or they don’t exist until a specific request. Data are re-created for many different uses and conditions (shopping, research, entertainment, way-finding).

The perfect web search engine: A Hyperlinked Set Thank God. The availability of hyperlinks creates an additional layer of meaning. This also places the web document set into a relational framework that can be very accurately described using a branch of mathematics called topology. Hyperlinks (the only new form of punctuation created in the last 500 years) allow us to do ranked searches.

Designing a precise search mechanism. 1.Crawler Module 2.Page Repository 3.Indexing Module 4.Indexes 5.Query Module 6.Ranking Module

The Pieces

The Crawler Module A distributed system of software robots (bots, spiders) designed to examine and record the content and structure of pages within a site within a defined domain. CM gives bots root URLs Spiders consume resources! (bandwidth, quotas) Should conform to ethical crawling (robots.txt)

The Page Repository Temporary storage for full page contents and link structure. Valuable and popular pages can be stored for longer term.

Indexing Module · A software processor that applies a compression algorithm. · For content, the algorithm generates an inverted file index. · Also yields Structure Indexes, and Special-purpose Indexes (for PDFs and video) Software, 2 Processor, 3 Compression, 7 Algorithm, 8, 12 Index(es), 17, 21, 25

Indexes Storage area for inverted files and other processed page results. These are the valuable assets of an Internet Search company.

The Query Module The software that handles user queries. Interacts with the ranking module, the indexes, and the page repository. Must be fast! Feb 2003, Google reported serving 250,000,000 searches per day. (2,894 queries per second)

The Ranking Module The software that examines the hyperlink structure and calculates a page’s value. The source of all Google’s income. The set of rules that generated a US$ 2 billion business in 3 three years.

2 guys and 2 theses Sergey Brin, Larry Page HITS and PageRank™

The Google PageRank Algorithm

Questions & Discussion