1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Natural Language Processing WEB SEARCH ENGINES August, 2002.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
Overview of Search Engines
1 Internet History Internet made up of thousands of networks worldwide No one in charge of Internet - No governing body Internet backbone owned by private.
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
HTTP; The World Wide Web Protocol
1 Web Developer & Design Foundations with XHTML Chapter 13 Key Concepts.
Web Crawlers.
Lecturer: Ghadah Aldehim
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
Chapter 6 The World Wide Web. Web Pages Each page is an interactive multimedia publication It can include: text, graphics, music and videos Pages are.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
Search Engines. Internet protocol (IP) Two major functions: Addresses that identify hosts, locations and identify destination Connectionless protocol.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Crawling Slides adapted from
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1/28: The Internet & Website Design What is the Internet? –Parts of the Internet –Internet & WWW basics –Searching the WWW Website design considerations.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
Information Retrieval and Web Search Web search. Spidering Instructor: Rada Mihalcea Class web page: (some of these.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Retrieving Information on the Web Presented by Md. Zaheed Iftekhar Course : Information Retrieval (IFT6255) Professor : Jian E. Nie DIRO, University of.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
CRAWLER DESIGN YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
Week 1 Introduction to Search Engine Optimization.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
Lesson 10—Networking BASICS1 Networking BASICS The Internet and Its Tools Unit 3 Lesson 10.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Information Retrieval
Web Search Introduction.
Web Search Introduction.
Web Search by Ray Mooney
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Web Searching Everything, now..
Web Search Introduction.
Information Retrieval and Web Search
Presentation transcript:

1 Web Search Introduction

2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet. Combined idea of documents available by FTP with the idea of hypertext to link documents. Developed initial HTTP network protocol, URLs, HTML, and first “web server.”

3 Web Pre-History Ted Nelson developed idea of hypertext in Doug Engelbart invented the mouse and built the first implementation of hypertext in the late 1960’s at SRI. ARPANET was developed in the early 1970’s. The basic technology was in place in the 1970’s; but it took the PC revolution and widespread networking to inspire the web and make it practical.

4 Web Browser History Early browsers were developed in 1992 (Erwise, ViolaWWW). In 1993, Marc Andreessen and Eric Bina at UIUC NCSA (University of Illinois) developed the Mosaic browser and distributed it widely. Andreessen joined with James Clark (Stanford Prof. and Silicon Graphics founder) to form Mosaic Communications Inc. in 1994 (which became Netscape to avoid conflict with UIUC). Microsoft licensed the original Mosaic from UIUC and used it to build Internet Explorer in 1995.

5 Search Engine Early History By late 1980’s many files were available by anonymous FTP. In 1990, Alan Emtage of McGill Univ. developed Archie (short for “archives”) – Assembled lists of files available on many FTP servers. –Allowed regex (regular expression) search of these file names. In 1993, Veronica and Jughead were developed to search names of text files available through Gopher (network protocol) servers.

6 Web Search History In 1993, early web robots (spiders) were built to collect URL’s: –Wanderer –ALIWEB (Archie-Like Index of the WEB) –WWW Worm (indexed URL’s and titles for regex search) In 1994, Stanford grad students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo!.

7 Web Search History (cont) In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Washington. (eventually became part of Excite and AOL). A few months later, Fuzzy Maudlin, a grad student at CMU developed Lycos. First to use a standard IR system as developed for the DARPA Tipster project. First to index a large set of pages. In late 1995, DEC developed Altavista. Used a large farm of Alpha machines to quickly process large numbers of queries. Supported boolean operators, phrases, and “reverse pointer” queries.

8 Web Search Recent History In 1998, Larry Page and Sergey Brin, Ph.D. students at Stanford, started Google. Main advance is use of link analysis to rank results partially based on authority of a web page (roughly for now: the number of incoming hyperlinks).

9 Web Challenges for IR Distributed Data: Documents spread over millions of different web servers. Volatile Data: Many documents change or disappear rapidly (e.g. dead links). Large Volume: Billions of separate documents. Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents. Quality of Data: No editorial control, false information, poor quality writing, typos, etc. Heterogeneous Data: Multiple media types (images, video, VRML), languages, character sets, etc.

BIG, HOW MUCH BIG? 10

Indexed pages (Google, 2014) 11

Google, last 2 years 12

Market shares per web servers 13

14 Growth of Web Pages Indexed SearchEngineWatch

15 Zipf’s Law on the Web Number of in-links/out-links to/from a page has a Zipfian distribution. Length of web pages has a Zipfian distribution. Number of hits to a web page has a Zipfian distribution. –N be the number of elements; –k be their rank; –s be the value of the exponent characterizing the distribution Zipf's law predicts that out of a population of N elements, the frequency of elements of rank k, f(k;s,N), is:

16 Graph Structure in the Web

MAKING MONEY 17

18 Business Models for Web Search Advertisers pay for banner ads on the site that do not depend on a user’s query. –CPM: Cost Per Mille. Pay for each ad display. (CPM=(cost of ad) / (#of readers/1000)) –CPC: Cost Per Click. Pay only when user clicks on ad. –CTR: Click Through Rate. Fraction of ad impressions* that result in clicks throughs. CPC = CPM / (CTR * 1000) –CPA: Cost Per Action (Acquisition). Pay only when user actually makes a purchase on target site. Advertisers bid for “keywords”. Ads for highest bidders displayed when user query contains a purchased keyword. –PPC: Pay Per Click. CPC for bid word ads (e.g. Google AdWords). 1. Impressions: The number of times pages from “your site” appeared in search results, and the percentage increase/decrease in the daily average impressions compared to the previous period. The number of days per period defaults to 30, but you can change it at any time.

19 Affiliates Programs If you have a website, you can generate income by becoming an affiliate by agreeing to post ads relevant to the topic of your site. If users click on your impression of an ad, you get some percentage of the CPC or PPC income that is generated. Google introduces AdSense affiliates program in 2003.

HOW DOES IT WORK? 20

–Q: How does a search engine know that all these pages contain the query terms? –A: Because all of those pages have been crawled 21

Crawler: basic idea starting pages (seeds) 22

What is a Web Crawler? A program for downloading web pages. Given an initial set of seed URLs, it recursively downloads every page that is linked from pages in the set. A focused web crawler downloads only those pages whose content satisfies some criterion.

Many names Crawler Spider Robot (or bot) Web agent Wanderer, worm, … And famous “instances”: googlebot, scooter, slurp, msnbot, … 24

Crawlers vs Browsers vs Scrapers Crawlers automatically harvest all files on the web Browsers are manual crawlers (user search through keywords) Scrapers takes pages that have been downloaded, or, data that are formatted for display, and extract data from it for manipulation

A crawler within a search engine 26 –Web –Text index–PageRank –Page repository –googlebot –Text & link analysis –Query –hits –Ranker

One taxonomy of crawlers Many other criteria could be used: –Incremental, Interactive, Concurrent, Etc. 27

Basic crawlers This is a sequential crawler Seeds can be any list of starting URLs Order of page visits is determined by frontier data structure Stop criterion can be anything

URL frontier The next node to crawl Can include multiple pages from the same host Must avoid trying to fetch them all at the same time

Basic crawler –Web –URLs crawled –and parsed –URLs frontier –Unseen Web –Seed –pages

Graph traversal (BFS or DFS?) Breadth First Search –Implemented with QUEUE (FIFO) –Finds pages along shortest paths –If we start with “good” pages, this keeps us close; maybe other good stuff… Depth First Search –Implemented with STACK (LIFO) –Wander away (“lost in cyberspace”) 31

Implementation issues Don’t want to fetch same page twice! –Keep lookup table (hash) of visited pages –What if not visited but in frontier already? The frontier grows very fast! –May need to prioritize for large crawls Fetcher must be robust! –Don’t crash if download fails –Timeout mechanism Determine file type to skip unwanted files –Can try using extensions, but not reliable –Can issue ‘HEAD’ HTTP commands to get Content-Type (MIME) headers, but overhead of extra Internet requests 32

More implementation issues Fetching –Get only the first KB per page –Take care to detect and break redirection loops –Soft fail for timeout, server not responding, file not found, and other errors 33

More implementation issues Spider traps –Misleading sites: indefinite number of pages dynamically generated by CGI scripts –Paths of arbitrary depth created using soft directory links and path rewriting features in HTTP server –Only heuristic defensive measures: Check URL length; assume spider trap above some threshold, for example 128 characters Watch for sites with very large number of URLs Eliminate URLs with non-textual data types May disable crawling of dynamic pages, if can detect 34

Concurrency A crawler incurs several delays: –Resolving the host name in the URL to an IP address using DNS –Connecting a socket to the server and sending the request –Receiving the requested page in response Solution: Overlap the above delays by fetching many pages concurrently 35

Architecture of a concurrent crawler 36

More detail –URLs crawled –and parsed –Unseen Web –Seed –Pages –URL frontier –Crawling thread

Concurrent crawlers Can use multi-processing or multi-threading Each process or thread works like a sequential crawler, except they share data structures: frontier and repository Shared data structures must be synchronized (locked for concurrent writes) Speedup of factor of 5-10 are easy this way 38

Universal crawlers Support universal search engines Large-scale Huge cost (network bandwidth) of crawl is amortized over many queries from users Incremental updates to existing index and other data repositories 39

Large-scale universal crawlers Two major issues: 1.Performance Need to scale up to billions of pages 2.Policy Need to trade-off coverage, freshness, and bias (e.g. toward “important” pages) 40

High-level architecture of a scalable universal crawler 41 –Several parallel queues to spread load across servers (keep connections alive) –DNS server using UDP (less overhead than TCP), large persistent in- memory cache, and prefetching –Optimize use of network bandwidth –Optimize disk I/O throughput –Huge farm of crawl machines –DNS domain name system: name  IP address –UDP User datagram Protocol (a connectionless protocol)

Universal crawlers: Policy Coverage –New pages get added all the time –Can the crawler find every page? Freshness –Pages change over time, get removed, etc. –How frequently can a crawler revisit ? Trade-off! –Focus on most “important” pages (crawler bias)? –“Importance” is subjective 42

Web coverage by search engine crawlers –This assumes we know the size of the entire the Web. Do we? Can you define “the size of the Web”?

Crawler ethics and conflicts Crawlers can cause trouble, even unwillingly, if not properly designed to be “polite” and “ethical” For example, sending too many requests in rapid succession to a single server can amount to a Denial of Service (DoS) attack! –Server administrator and users will be upset –Crawler developer/admin IP address may be blacklisted 44

45

Web page ranking: Manual/automatic classification Link analysis 46

Why we need to classify pages? Vector Space ranking is not enough Queries on the web return millions hits based only on content similarity (Vector Space or other ranking methods) Need additional criteria for selecting good pages: –Classification of web pages into pre-defined categories –Assigning relevance to pages depending upon their “position” in the web graph (link analysis) 47

48 Manual Hierarchical Web Taxonomies Yahoo (old) approach of using human editors to assemble a large hierarchically structured directory of web pages. – Open Directory Project is a similar approach based on the distributed labor of volunteer editors (“net-citizens provide the collective brain”). Used by most other search engines. Started by Netscape. –

Web page classification 53 Except for ODP, page categorization is “openly” used only by focused search engines (eg Ebay, Amazon..) The general problem of webpage classification can be divided into –Subject classification; subject or topic of webpage e.g. “Adult”, “Sport”, “Business”. –Function classification; the role that the webpage play e.g. “Personal homepage”, “Course page”, “Commercial page”.

54 Types of classification –Hard vrs. Soft (multi-class) classification

Web Page Classification 55 Constructing and expanding web directories (web hierarchies) –How are they doing?

56 Keyworder –By human effort July 2006, it was reported there are 73,354 editor in the dmoz ODP.

57 Automatic Document Classification Manual classification into a given hierarchy is labor intensive, subjective, and error-prone. Text categorization methods provide a way to automatically classify documents. Best methods based on training a machine learning (pattern recognition) system on a labeled set of examples (supervised learning).

Hierarchical Agglomerative Clustering 58 Strategies for hierarchical clustering generally fall into two types: – Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. – Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

59 Hierarchical Agglomerative Clustering (HAC) Algorithm –Start with all instances 1 in their own cluster. –Until there is only one cluster: – Among the current clusters, determine the two clusters, c i and c j, that are most similar. – Replace c i and c j with a single cluster c i  c j “instance” is a web page, represented by a vector (see later)

60 Dendrogram: Document Example As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts. d1 d2 d3 d4 d5 d1,d2 d4,d5 d3 d3,d4,d5

Feature selection in HAC Problem: how do we describe a page? Bag-of-words vector not appropriate in this case (= one position in the page vector for each word in the page) Lower number of more descriptive features, based on two criteria: –On-page (selected features in the page) –Neibourgh features (selected features in the pages “pointing” at that page

Features: On-page Textual content and tags –N-gram feature (n-gram= sequence of n consecutive words) Also called n-words, e.g. “New York” is a biword. In Yahoo!, They used 5-grams feature. –HTML tags or DOM (document object model) Title, Headings, Metadata and Main text –Assigned each of them an arbitrary weight. –Now a day most of website using Nested list ( ) which really help in web page classification.

Features: On-page Visual analysis –Each webpage has two representations 1.Text represented in HTML 2.The visual representation rendered by a web browser –visual information is useful as well Each webpage is represented as a hierarchical “Visual adjacency multi graph.” In the graph each node represents an HTML object and each edge represents the spatial relation in the visual representation.

Visual analysis

Features: Neighbors Features

Motivation –Often in-page features are missing or unrecognizable

Example webpage which has few useful on-page features

Features: Neighbors features Underlying Assumptions –When exploring the features of neighbors, some assumptions are implicitly made (like e.g. homophyly). –The presence of many “sports” pages in the neighborhood of Page-a increases the probability of Page-a being in “Sport”. –linked pages are more likely to have terms in common. Neighbor selection –Existing research mainly focuses on page within two steps of the page to be classified. (At the distance no greater than two). –There are six types of neighboring pages: parent, child, sibling, spouse, grandparent and grandchild.

Neighbors with in radius of two

Features: Neighbors features Neighbor selection cont. The text on the parent pages surrounding the link is used to train a classifier instead of text on the target page. Using page title and anchor text from parent pages can improve classification compared a pure text classifier.

Features: Neighbors features Neighbor selection cont. –Summary Using parent, child, sibling and spouse pages are all useful in classification, siblings are found to be the best source. Using information from neighboring pages may introduce extra noise, should be use carefully.

Features: Neighbors features Utilizing artificial links (implicit link) –The hyperlinks are not the only one choice to find neighbors. What is implicit link? –Connections between pages that appear in the results of the same query and are both clicked by users. Implicit link can help webpage classification as well as hyperlinks.

Next.. Link Analysis (page Rank, HITS) 74