General Architecture of Retrieval Systems 1Adrienn Skrop.

Slides:



Advertisements
Similar presentations
WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Project 1 Introduction to HTML.
Information Retrieval in Practice
1 Chapter 12 Working With Access 2000 on the Internet.
XP Browser and Basics1. XP Browser and Basics2 Learn about Web browser software and Web pages The Web is a collection of files that reside.
The Internet Useful Definitions and Concepts About the Internet.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Master’s course Bioinformatics Data Analysis and Tools Lecture 6: Internet Basics Centre for Integrative Bioinformatics.
Guide To UNIX Using Linux Third Edition
Overview of Search Engines
HTML 1 Introduction to HTML. 2 Objectives Describe the Internet and its associated key terms Describe the World Wide Web and its associated key terms.
Chapter ONE Introduction to HTML.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
Internet Standard Grade Computing. Internet a wide area network spanning the globe. consists of many smaller networks linked together. Service a way of.
INTRODUCTION TO WEB DATABASE PROGRAMMING
Computer Concepts 2014 Chapter 7 The Web and .
Chapter 16 The World Wide Web Chapter Goals Compare and contrast the Internet and the World Wide Web Describe general Web processing Describe several.
Server-side Scripting Powering the webs favourite services.
DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.
Lecturer: Ghadah Aldehim
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
PIZZA WEB PAGE May 28, FOR TODAY  Review Vocabulary Words (take out your worksheets!)  Pizza Web Page  Research more tags  Turn in your homework!
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
Introduction to the Internet. What is the Internet The Internet is a worldwide group of connected networks that allows public access to information and.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Internet Basics Dr. Norm Friesen June 22, Questions What is the Internet? What is the Web? How are they different? How do they work? How do they.
First things, First Do you belong in here? – 10 – 12 – Comp. Discovery or Keyboard/Comp Apps – Do you have any experience with Web Page Design?????
Internet Fundamentals Total Advantage MS Excel 97, Hutchinson, Coulthard, 1998 McGraw Introduction to HTML Chapter 7.
Postacademic Interuniversity Course in Information Technology – Module C1p1 Contents Data Communications Applications –File & print serving –Mail –Domain.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Active Server Pages  In this chapter, you will learn:  How browsers and servers interacted on the Internet when the Internet first became popular 
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 8 Browsing and Searching the Web. 2Practical PC 5 th Edition Chapter 8 Getting Started In this Chapter, you will learn: − What is a Web page −
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
1 MSCS 237 Overview of web technologies (A specific type of distributed systems)
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Web Browsers  Web browser- software that you run on your computer to make it work as a web client.  Web Servers- Computers connected to the Internet.
Chapter 29 World Wide Web & Browsing World Wide Web (WWW) is a distributed hypermedia (hypertext & graphics) on-line repository of information that users.
Living Online Lesson 3 Using the Internet IC3 Basics Internet and Computing Core Certification Ambrose, Bergerud, Buscge, Morrison, Wells-Pusins.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
Website design and structure. A Website is a collection of webpages that are linked together. Webpages contain text, graphics, sound and video clips.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
HTML Concepts and Techniques Fifth Edition Chapter 1 Introduction to HTML.
A s s i g n m e n t W e e k 7 : T h e I n t e r n e t B Y : P a t r i c k O b i s p o.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Web Design Vocabulary #3. HTML Hypertext Markup Language - The coding scheme used to format text for use on the World Wide Web.
1 Chapter 1 INTRODUCTION TO WEB. 2 Objectives In this chapter, you will: Become familiar with the architecture of the World Wide Web Learn about communication.
Information Retrieval in Practice
Chapter 8 Browsing and Searching the Web
WWW and HTTP King Fahd University of Petroleum & Minerals
Search Engine Architecture
Chapter 1 Introduction to HTML.
E-commerce | WWW World Wide Web - Concepts
Project 1 Introduction to HTML.
E-commerce | WWW World Wide Web - Concepts
Chapter 27 WWW and HTTP.
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
Chapter 16 The World Wide Web.
Presentation transcript:

General Architecture of Retrieval Systems 1Adrienn Skrop

General Architecture of a Retrieval System 2Adrienn Skrop

REPOSITORY  The entities (documents) to be searched are stored in a central REPOSITORY (on computer disks). They are collected and entered into the REPOSITORY manually or using specialized computer programs. 3Adrienn Skrop

INDEXING MODULE  Using the documents stored in the REPOSITORY, the INDEXING MODULE creates the INDEXES in the form of inverted file structures. These structures are being used by the QUERY MODULE to find documents that match the user’s query. 4Adrienn Skrop

QUERY MODULE.  It reads in the user’s query. The QUERY MODULE, using INDEXES, finds the documents which match the query (typically, the documents that contain the query terms). It then passes the located documents to the RANKING MODULE. 5Adrienn Skrop

RANKING MODULE  It computes similarity scores (using INDEXES) for the documents located by the QUERY MODULE.  Then, the documents are ranked (sorted descending) on their similarity score, and are presented to the user in this order (this list is called hit list).  For the computation of similarity scores, several methods can be used. 6Adrienn Skrop

Elements of Web Retrieval Technology 7 Adrienn Skrop

World Wide Web  The World Wide Web is a network of electronic documents stored on dedicated computers (servers) around the world.  Documents can contain different types of data, such as text, image, or sound. They are stored in units referred to as Web pages.  Each page has a unique code, called URL (Universal Resource Locator), which identifies its location on a server.  The number of Web pages is referred to as the size of the Web. (More than 12 billion pages to date). 8Adrienn Skrop

Major Characteristics of the Web  Most Web documents are in HTML (Hypertext Mark Up Language) format, containing many tags (provide important information about the page).  E.g., the tag, which is a bold typeface markup, usually increases the importance of the term it refers to.  Web pages can be less structured (there does not exist a generally recommended or prescribed format) 9Adrienn Skrop

Web pages are diverse  they can be written in many language, moreover several languages may be used within the same page,  the grammar of the text in a page may not always be checked very carefully,  the style used varies to a great extent,  the length of pages is virtually not limited (if at all, then the limits are posed by, e.g., disk capacity, memory). 10Adrienn Skrop

Web pages can contain a variety of data types including  text,  image,  sound,  video,  executable code. 11Adrienn Skrop

Many different formats are used  HTML,  XML,  PDF,  MSWord,  mp3,  avi,  mpeg,  etc.. 12Adrienn Skrop

Major Characteristics of the Web  While most documents in classical Information Retrieval are considered to be static,  Web pages are dynamic, i.e., they can be  updated frequently,  deleted or added,  dynamically generated. 13Adrienn Skrop

Major Characteristics of the Web  Web pages can be hyperlinked, which generates a linked network of Web pages.  Factors like  a Universal Resource Locator from a Web page to another page,  anchor text,  the underlined, clickable text can provide additional information about the importance of the target page. 14Adrienn Skrop

General Architecture of a Web Search Engine 15Adrienn Skrop

CRAWLER MODULE  In a traditional retrieval system, the documents are stored in a centralised repository, i.e., on computer disks, specifically in a particular institution (university library, computing department in a bank, etc.).  As opposed to this, Web pages are stored in a decentralised manner: in computers around the whole world. While this has advantages (e.g., there are no geographic boundaries between documents),   it also means that search engines need to collect documents from around the world. 16Adrienn Skrop

CRAWLER MODULE  This task is being performed by specialised computer programs which together make up the CRAWLER MODULE. They need to run all the time, day and night.  Virtual robots, named spiders, ‘walk’ on the Web, from page to page, download and send them to the REPOSITORY. 17Adrienn Skrop

REPOSITORY  The Web pages downloaded by spiders are being stored in the REPOSITORY (which physically means computer disks mounted on computers belonging to the company which runs the search engine).  Pages are sent from the REPOSITORY to the INDEXING MODULE for further processing. Important or popular pages can be stored for a longer (even a very long) period of time. 18Adrienn Skrop

INDEXING MODULE  The Web pages from the REPOSITORY are being processed by the programs of the INDEXING MODULE (HTML tags are filtered, terms are extracted, etc.).  In other words, a compressed representation is obtained for pages by recognising and extracting important information. 19Adrienn Skrop

INDEXES  It is logically organised as an inverted file structure (physically implemented in compressed ways in order to save memory). It is typically divided into several substructures:  The content structure is an inverted structure which stores, for example, terms, anchor text, etc. for pages.  The link structure stores connection information between pages (i.e., which page has a link to which page). The spider may access the link structure to find addresses of uncrawled pages. 20Adrienn Skrop

QUERY MODULE  Step 1. The QUERY MODULE reads in what the user has typed into the query line, analyses and transforms it into an appropriate (for example, numeric code) format.  Step 2. The QUERY MODULE consults the INDEXES in order to find pages which match the user’s query (for example, pages containing the query terms).  Step 3. It then sends the matching pages to the RANKING MODULE. 21Adrienn Skrop

RANKING MODULE  The pages sent by the QUERY MODULE are ranked (sorted in descending order) according to a similarity score. The list obtained is called hit list, and it is presented to the user on the computer screen in the form of a list of URLs.  The user can access the entire page by clicking on its URL. 22Adrienn Skrop

RANKING MODULE  The similarity score is computed based on several criteria and using several methods. This calculation is based on a combination of methods from traditional Information Retrieval and Web specific factors.  Typical factors are:  page content factors (e.g., tf in the page),  on-page factors (e.g., the position of the term in the page, the size of characters of the term),  link information (which pages link to the page under focus, and which pages it links to),  etc.. 23Adrienn Skrop

General Architecture of a Web Meta Search Engine 24Adrienn Skrop

Meta Search Engine Typically, a meta search engine:  reads in the user’s request,   sends it to several search engines,   downloads some of the pages   they return in response to the query   and then produces its own hit list using those pages. 25Adrienn Skrop

INTERFACE MODULE  Responsible for getting user’s query input.  The query is entered as a set of terms (separated by commas), they are Porter- stemmed, and then sent to commercial spider-based Web search engines as HTTP requests. 26Adrienn Skrop

Meta Search Engine  The first n elements from the hit list of each Web search engine are considered, and the corresponding Web pages are downloaded in parallel (Parallel User Agent) for speed.  Each Web page undergoes the following processing: tags are removed, terms are identified, stoplisted and Porter-stemmed.  The result will be a repository of these pages on the server disk. This repository is processed by the RANKING MODULE. 27Adrienn Skrop

REPOSITORY MODULE  It stores the data sent by the INTERFACE MODULE on the server disk, i.e., the transformed Web pages downloaded by the INTERFACE MODULE.  This file is created „on the fly“, during the process of answering the query. 28Adrienn Skrop

RANKING MODULE  This module works online.  Responsible for ranking results according to some predefined evaluation method. 29Adrienn Skrop