1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Principles of IR Hacettepe University Department of Information Management DOK 324: Principles of IR.
The Evolution of Online Advertisement Casey Shannon CompSci 49S February 21, 2008.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
Building Web Spiders Web-Based Information Architectures MSEC Mini II Jaime Carbonell.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
 Popularity of browsers:  Popularity of search.
Internet Research Search Engines & Subject Directories.
1 Internet History Internet made up of thousands of networks worldwide No one in charge of Internet - No governing body Internet backbone owned by private.
Historical Background An internet server from which hierarchically-organised text files could be retrieved from allover the world. Developed at the University.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
1 Accessing the Global Database The World Wide Web.
Introductions Search Engine Development COMP 475 Spring 2009 Dr. Frank McCown.
Discover the Information Superhighway Explore How It Serves You Test-Drive the Internet.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Chapter 6 The World Wide Web. Web Pages Each page is an interactive multimedia publication It can include: text, graphics, music and videos Pages are.
Search Engines. Internet protocol (IP) Two major functions: Addresses that identify hosts, locations and identify destination Connectionless protocol.
Chapter 8 The Internet: A Resource for All of Us.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
Crawling Slides adapted from
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Search Engine Interfaces search engine modus operandi.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
1/28: The Internet & Website Design What is the Internet? –Parts of the Internet –Internet & WWW basics –Searching the WWW Website design considerations.
Information Retrieval and Web Search Web search. Spidering Instructor: Rada Mihalcea Class web page: (some of these.
Artificial Intelligence Web Spidering & HW1 Preparation Jaime Carbonell 22 January 2002 Today's Agenda Finish A*, B*, Macrooperators.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Retrieving Information on the Web Presented by Md. Zaheed Iftekhar Course : Information Retrieval (IFT6255) Professor : Jian E. Nie DIRO, University of.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
By: Sam Poggi Google Inc. 39 employees Mostly engineers Money was running out, and Google needed a business model that would begin to bring in money.
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Search Engines Information Technology and Social Life March 2, 2005.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Internet.
Search Engines & Subject Directories
Information Retrieval
Computer Networks and Internet
Data Mining Chapter 6 Search Engines
Web Search Introduction.
Search Engines & Subject Directories
Search Engines & Subject Directories
Web Search Introduction.
Web Search by Ray Mooney
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Web Searching Everything, now..
Web Search Introduction.
Information Retrieval and Web Search
Presentation transcript:

1 Web Search Introduction

2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet. Combined idea of documents available by FTP with the idea of hypertext to link documents. Developed initial HTTP network protocol, URLs, HTML, and first “web server.”

3 Web Pre-History Ted Nelson developed idea of hypertext in Doug Engelbart invented the mouse and built the first implementation of hypertext in the late 1960’s at SRI. ARPANET was developed in the early 1970’s. The basic technology was in place in the 1970’s; but it took the PC revolution and widespread networking to inspire the web and make it practical.

4 Web Browser History Early browsers were developed in 1992 (Erwise, ViolaWWW). In 1993, Marc Andreessen and Eric Bina at UIUC NCSA developed the Mosaic browser and distributed it widely. Andreessen joined with James Clark (Stanford Prof. and Silicon Graphics founder) to form Mosaic Communications Inc. in 1994 (which became Netscape to avoid conflict with UIUC). Microsoft licensed the original Mosaic from UIUC and used it to build Internet Explorer in 1995.

5 Search Engine Early History By late 1980’s many files were available by anonymous FTP. In 1990, Alan Emtage of McGill Univ. developed Archie (short for “archives”) – Assembled lists of files available on many FTP servers. –Allowed regex search of these file names. In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.

6 Web Search History In 1993, early web robots (spiders) were built to collect URL’s: –Wanderer –ALIWEB (Archie-Like Index of the WEB) –WWW Worm (indexed URL’s and titles for regex search) In 1994, Stanford grad students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.

7 Web Search History (cont) In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Wash. (eventually became part of Excite and AOL). A few months later, Fuzzy Maudlin, a grad student at CMU developed Lycos. First to use a standard IR system as developed for the DARPA Tipster project. First to index a large set of pages. In late 1995, DEC developed Altavista. Used a large farm of Alpha machines to quickly process large numbers of queries. Supported boolean operators, phrases, and “reverse pointer” queries.

8 Web Search Recent History In 1998, Larry Page and Sergey Brin, Ph.D. students at Stanford, started Google. Main advance is use of link analysis to rank results partially based on authority.

9 Web Challenges for IR Distributed Data: Documents spread over millions of different web servers. Volatile Data: Many documents change or disappear rapidly (e.g. dead links). Large Volume: Billions of separate documents. Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents. Quality of Data: No editorial control, false information, poor quality writing, typos, etc. Heterogeneous Data: Multiple media types (images, video, VRML), languages, character sets, etc.

10 Growth of Web Pages Indexed SearchEngineWatch Link to Note from Jan 2004 Assuming 20KB per page, 1 billion pages is about 20 terabytes of data. Billions of Pages Google Inktomi AllTheWeb Teoma Altavista

11 Graph Structure in the Web

12 Zipf’s Law on the Web Number of in-links/out-links to/from a page has a Zipfian distribution. Length of web pages has a Zipfian distribution. Number of hits to a web page has a Zipfian distribution.

13 Zipfs Law and Web Page Popularity

14 “Small World” (Scale-Free) Graphs Social networks and six degrees of separation. Power law distribution of in and out degrees. Distinct from purely random graphs. “Rich get richer” generation of graphs. Kevin Bacon game. Erdos number. Networks in biochemistry, roads, telecommunications, Internet, etc are “small word”

15 Manual Hierarchical Web Taxonomies Yahoo approach of using human editors to assemble a large hierarchically structured directory of web pages. – Open Directory Project is a similar approach based on the distributed labor of volunteer editors (“net-citizens provide the collective brain”). Used by most other search engines. Started by Netscape. –

16 Business Models for Web Search Advertisers pay for banner ads on the site that do not depend on a user’s query. –CPM: Cost Per Mille (thousand impressions). Pay for each ad display. –CPC: Cost Per Click. Pay only when user clicks on ad. –CTR: Click Through Rate. Fraction of ad impressions that result in clicks throughs. CPC = CPM / (CTR * 1000) –CPA: Cost Per Action (Acquisition). Pay only when user actually makes a purchase on target site. Advertisers bid for “keywords”. Ads for highest bidders displayed when user query contains a purchased keyword. –PPC: Pay Per Click. CPC for bid word ads (e.g. Google AdWords).

17 History of Business Models Initially, banner ads paid thru CPM were the norm. GoTo Inc. formed in 1997 and originates and patents bidding and PPC business model. Google introduces AdWords in fall GoTo renamed Overture in Oct Overture sues Google for use of PPC in Apr Overture acquired by Yahoo in Oct Google settles with Overture/Yahoo for 2.7 million shares of Class A common stock in Aug

18 Affiliates Programs If you have a website, you can generate income by becoming an affiliate by agreeing to post ads relevant to the topic of your site. If users click on your impression of an ad, you get some percentage of the CPC or PPC income that is generated. Google introduces AdSense affiliates program in 2003.

19 Automatic Document Classification Manual classification into a given hierarchy is labor intensive, subjective, and error-prone. Text categorization methods provide a way to automatically classify documents. Best methods based on training a machine learning (pattern recognition) system on a labeled set of examples (supervised learning). Text categorization is a topic we will discuss later in the course.

20 Automatic Document Hierarchies Manual hierarchy development is labor intensive, subjective, and error-prone. It would nice to automatically construct a meaningful hierarchical taxonomy from a corpus of documents. This is possible with hierarchical text clustering (unsupervised learning). –Hierarchical Agglomerative Clustering (HAC) Text clustering is a another topic we will discuss later in the course.

21 Web Search Using IR Query String IR System Ranked Documents 1. Page1 2. Page2 3. Page3. Document corpus Web Spider

WEB CRAWLERS, SPIDERS, ROBOTS

–A Typical Web Search Engine –Interface –Query Engine –Indexer –Index –Crawler Users Web

Components of Web Search Service –Components –Web crawler –Indexing system –Search system –Considerations –Economics –Scalability –Legal issues

–Interface –Query Engine –Indexer –Index –Crawler Users Web –A Typical Web Search Engine

What is a Web Crawler? A program for downloading web pages. –Given an initial set of seed URLs, it recursively downloads every page that is linked from pages in the set. –A focused web crawler downloads only those pages whose content satisfies some criterion. –Also known as a web spider, bot, harvester.

Crawlers vs Browsers vs Scrapers Crawlers automatically harvest all files on the web Browsers are manual crawlers Scrapers automatically harvest the visual files for a web site and are limited crawlers

Crawling the web –Web –URLs crawled –and parsed –URLs frontier –Unseen Web –Seed –pages

More detail –URLs crawled –and parsed –Unseen Web –Seed –Pages –URL frontier –Crawling thread

URL frontier The next node to crawl Can include multiple pages from the same host Must avoid trying to fetch them all at the same time Must try to keep all crawling threads busy

Spider Algorithms 1 (depth first) PROCEDURE SPIDER 1 (G) Let ROOT := any URL from G Initialize STACK Let STACK := push(ROOT, STACK) Initialize COLLECTION While STACK is not empty, URL curr := pop(STACK) PAGE := look-up(URL curr ) STORE(, COLLECTION) For every URL i in PAGE, push(URL i, STACK) Return COLLECTION What is wrong with the above algorithm?

Depth-first Search –1–1 –1–1 –2–2 –2–2 –3–3 –3–3 –4–4 –4–4 –5–5 –5–5 –6–6 –6–6 –7–7 –7–7 numbers = order in which nodes are visited

Graph-Search Algorithms 2 (a) SPIDER 1 is Incorrect! What about loops in the web graph? => Algorithm will not halt What about convergent DAG structures? => Pages will replicated in collection => Inefficiently large index => Duplicates to annoy user

Graph-Search Algorithms II (2) SPIDER 1 is Incomplete! Web graph has k-connected subgraphs. SPIDER 1 only reaches pages in the the connected web subgraph where ROOT page lives.

A better Spidering Algorithm (breadth first) PROCEDURE SPIDER 2 (G) Let ROOT := any URL from G Initialize STACK Let STACK := push(ROOT, STACK) Initialize COLLECTION While STACK is not empty, | Do URL curr := pop(STACK) | Until URL curr is not in COLLECTION PAGE := look-up(URL curr ) STORE(, COLLECTION) For every URL i in PAGE, push(URL i, STACK) Return COLLECTION

A More Efficient BF Spidering Algorithm PROCEDURE SPIDER 3 (G) Let ROOT := any URL from G Initialize STACK Let STACK := push(ROOT, STACK) Initialize COLLECTION | Initialize VISITED While STACK is not empty, | Do URL curr := pop(STACK) | Until URL curr is not in VISITED | insert-hash(URL curr, VISITED) PAGE := look-up(URL curr ) STORE(, COLLECTION) For every URL i in PAGE, push(URL i, STACK) Return COLLECTION

A More Complete Correct BF Spidering Algorithm PROCEDURE SPIDER 4 (G, {SEEDS}) |Initialize COLLECTION |Initialize VISITED | For every ROOT in SEEDS |Initialize STACK | Let STACK := push(ROOT, STACK) While STACK is not empty, Do URL curr := pop(STACK) Until URL curr is not in VISITED insert-hash(URL curr, VISITED) PAGE := look-up(URL curr ) STORE(, COLLECTION) For every URL i in PAGE, push(URL i, STACK) Return COLLECTION

Completeness Observations Completeness is not guaranteed In k-connected web G, we do not know k Impossible to guarantee each connected subgraph is sampled Better: more seeds, more diverse seeds

Completeness Observations Search Engine Practice Wish to maximize subset of web indexed. Maintain (secret) set of diverse seeds (grow this set opportunistically, e.g. when X complains his/her page not indexed). Register new web sites on demand New registrations are seed candidates.

To Spider or not to Spider? (1) User Perceptions Most annoying: Engine finds nothing (too small an index, but not an issue since 1997 or so). Somewhat annoying: Obsolete links => Refresh Collection by deleting dead links => Done every 1-2 weeks in best engines Mildly annoying: Failure to find new site => Re-spider entire web => Done every 2-4 weeks in best engines

To Spider or not to Spider? (2) Cost of Spidering Semi-parallel algorithmic decomposition Spider can (and does) run in hundreds of severs simultaneously Very high network connectivity Servers can migrate from spidering to query processing depending on time-of-day load Running a full web spider takes days even with hundreds of dedicated servers

Current Status of Web Spiders Historical Notes WebCrawler: first documented spider Lycos: first large-scale spider Top-honors for most web pages spidered: First Lycos, then Alta Vista, then Google...

Current Status of Web Spiders Enhanced Spidering In-link counts to pages can be established during spidering (if many pages point to page P, then P is presumably a good page). Hint: In SPIDER 4, store pair in VISITED hash table. In-link counts are the basis for GOOGLE’s page-rank method