1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Search Engines and Information Retrieval
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
CS 345A Data Mining Lecture 1
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
(c) Maria Indrawan Distributed Information Retrieval.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
What is the Internet? The Internet is a computer network connecting millions of computers all over the world It has no central control - works through.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
1 CS/INFO 430 Information Retrieval Lecture 23 Usability 1.
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
CSE Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11.
Courtney Forsmann IT Help Desk Manager Lewis-Clark State College October 1, 2014.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Chapter 8 Browsing and Searching the Web. 2Practical PC 5 th Edition Chapter 8 Getting Started In this Chapter, you will learn: − What is a Web page −
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 CS/INFO 430 Information Retrieval Lecture 21 Web Search 3.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 CS 430: Information Discovery Lecture 21 Non-Textual Materials 1.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Automated Information Retrieval
Information Retrieval in Practice
22C:145 Artificial Intelligence
Chapter 8 Browsing and Searching the Web
Search Engine Architecture
Text Based Information Retrieval
Web Mining Ref:
Introduction to Web Mining
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Data Mining Chapter 6 Search Engines
CS246: Search-Engine Scale
CS 345A Data Mining Lecture 1
CS/INFO 430 Information Retrieval
CS 345A Data Mining Lecture 1
Introduction to Web Mining
CS 345A Data Mining Lecture 1
cs430 lecture 02/22/01 Kamen Yotov
Discussion Class 9 Google.
Presentation transcript:

1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3

2 Course Administration

3 Information Retrieval Using PageRank Simple Method: Rank by Popularity Consider all hits (i.e., all document that match the query in the Boolean sense) as equal. Display the hits ranked by PageRank. The disadvantage of this method is that it gives no attention to how closely a document matches a query

4 Combining Term Weighting with Reference Pattern Ranking Combined Method 1. Find all documents that contain the terms in the query vector. 2. Let s j be the similarity between the query and document j, calculated using tf.idf or a related method. 3. Let p j be the popularity of document j, calculated using PageRank or another measure of importance. 4. The combined rank c j = s j + (1- )p j, where is a constant. 5. Display the hits ranked by c j.

5 Questions about PageRank Most pages have very small page ranks For searches that return large numbers of hits, there are usually a reasonable number of pages with high PageRank. For searches that return smaller numbers of hits, e.g, highly specific queries, all the pages may have very small PageRanks, so that it is difficult to rank them in a sensible order. Example A search by a customer for information about a product may rank a large number of mail order businesses that sell the product above the manufacturer's site that provides a specification for the product. Small numbers of links may make big changes to rank.

6 Advanced Graphical Methods: Carry out a search Divide Web sites found by a search into clusters, known as communities Calculate authority within communities Calculate hubs within communities, known as experts Note: Teoma does not publish the precise algorithms it uses

7 Other Factors in Ranking Coefficient s j and p j may be varied by adding other evidence. Similarity ranking s j might weight: structural mark-up, e.g., headings, bold, etc. meta-tags anchor text and adjacent text in the linking page file names Popularity ranking p j might weight: usage data of page previous searches by same user

8 Anchor Text and Adjacent Text Document A provides information about document B Adjacent text Anchor text

9 Anchor Text and File Names The Faculty of Computing and Information Science The source of Document A contains the marked-up text: This string provides the following index terms about Document B: Anchor text: faculty, computing, information, science File name: cis, cornell Note: A specific stop list is needed for each category of text.

10 Indexing Non-Textual Materials Factors that can be used to index non-textual materials: anchor text, including tags text adjacent to an anchor file names PageRank This is the concept behind image searching on the Web.

11 Context: Image Searching HTML source From the Information Science web site Captions and other adjacent text on the web page

12 Evaluation Web Searching Test corpus must be dynamic The web is dynamic (10%-20%) of URLs change every month Spam methods change change continually Queries are time sensitive Topic are hot and then not Need to have a sample of real queries Languages At least 90 different languages Reflected in cultural and technical differences Amil Singhal, Google, 2004

13 Evaluation: Search + Browse Users give queries of 2 to 4 words Most users click only on the first few results; few go beyond the fold on the first page 80% of users, use search engine to find sites: search to find site browse to find information Amil Singhal, Google, 2004 Browsing is a major topic in the lectures on Usability

14 Evaluation: The Human in the Loop Search index Return hits Browse documents Return objects

15 Scalability Question: How big is the Web and how fast is it growing? Answer: Nobody knows Estimates of the Crawled Web: ,000 pages 19971,000,000 pages 20001,000,000,000 pages 20058,000,000,000 pages Rough estimates of the Crawlable Web suggest at least 4x Rough estimates of the Deep Web suggest at least 100x

16 Scalability: Software and Hardware Replication Search service index server document server spell checking spell checker advertisement server

17 Scalability: Large-scale Clusters of Commodity Computers "Component failures are the norm rather than the exception.... The quantity and quality of the components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies...." Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, "The Google File System." 19th ACM Symposium on Operating Systems Principles, October

18 Scalability: Performance Very large numbers of commodity computers Algorithms and data structures scale linearly Storage –Scale with the size of the Web –Compression/decompression System –Crawling, indexing, sorting simultaneously Searching –Bounded by disk I/O

19 Scalability of Staff: Growth of Google In 2000: 85 people 50% technical, 14 Ph.D. in Computer Science In 2000: Equipment 2,500 Linux machines 80 terabytes of spinning disks 30 new machines installed daily Reported by Larry Page, Google, March 2000 At that time, Google was handling 5.5 million searches per day Increase rate was 20% per month By fall 2002, Google had grown to over 400 people. By fall 2006, Google had over 9,000 people.

20 Scalability: Numbers of Computers Very rough calculation In March 2000, 5.5 million searches per day, required 2,500 computers In fall 2004, computers were about 8 times more powerful. Estimated number of computers for 250 million searches per day: = (250/5.5) x 2,500/8 = about 15,000 Some industry estimates (based on Google's capital expenditure) suggest that Google and Yahoo may have had as many as 250,000+ computers in fall 2006.

21 Scalability: Staff Programming: As the number of programmers grows it becomes increasingly difficult to maintain the quality of software. Have very well trained staff. Isolate complex code. Most coding is single image. System maintenance: Organize for minimal staff (e.g., automated log analysis, do not fix broken computers). Customer service: Automate everything possible, but complaints, large collections, etc. still require staff.

22 Scalability of Staff: The Neptune Project The Neptune Clustering Software: Programming API and runtime support, which allows a network service to be programmed quickly for execution on a large-scale cluster in handling high-volume user traffic. The system shields application programmers from the complexities of replication, service discovery, failure detection and recovery, load balancing, resource monitoring and management. Tao Yang, University of California, Santa Barbara

23 Web search services are centralized systems Over the past 12 years, Moore's Law has enabled Web search services to keep pace with the growth of the Web and the number of users, while adding extra function. Will this continue? Possible areas for concern are: staff costs, telecommunications costs, disk and memory access rates, equipment costs. Scalability: the Long Term

24 Growth of Web Searching In November 1997: AltaVista was handling 20 million searches/day. Google forecast for 2000 was 100s of millions of searches/day. In 2004, Google reported 250 million webs searches/day, and estimated that the total number over all engines was 500 million searches/day. Moore's Law and Web searching In 7 years, Moore's Law predicts computer power increased by a factor of at least 2 4 = 16. It appears that computing power is growing at least as fast as web searching.

25 Other Uses of Web Crawling and Associated Technology The technology developed for Web search services has many other applications. Conversely, technology developed for other Internet applications can be applied in Web searching Related objects (e.g., Amazon's "Other people bought the following"). Recommender and reputation systems (e.g., ePinion's reputation system).