Search Engine Technology Slides are revised version of the ones taken from Homework 1 returned Stats: Total: 38 Min:

Slides:



Advertisements
Similar presentations
Indexing & Tolerant Dictionaries The pdf image slides are from Hinrich Schütze’s slides,Hinrich Schütze L'Homme qui marche Alberto Giacometti (sold for.
Advertisements

Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
9/10: Indexing & Tolerant Dictionaries Make-up Class: 10:30  11:45AM The pdf image slides are from Hinrich Schütze’s slides,Hinrich Schütze.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Green Island (Coral Cay; Great Barrier Reef; Australia; 9/18)
Chapter 19: Information Retrieval
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Information Retrieval
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Adversarial Information Retrieval The Manipulation of Web Content.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Web- and Multimedia-based Information Systems Lecture 2.
Search Tools and Search Engines Searching for Information and common found internet file types.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
1 CS 430: Information Discovery Lecture 5 Ranking.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
CSE 454 Indexing. Todo A bit repetitive – cut some slides Some inconsistencie – eg are positions in the index or not. Do we want nutch as case study instead.
Search Engine Optimization
Information Retrieval in Practice
Why indexing? For efficient searching of a document
How do Web Applications Work?
Information Retrieval
IST 516 Fall 2011 Dongwon Lee, Ph.D.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Data Mining Chapter 6 Search Engines
Chapter 5: Information Retrieval and Web Search
Efficient Retrieval Document-term matrix t1 t tj tm nf
Presentation transcript:

Search Engine Technology Slides are revised version of the ones taken from Homework 1 returned Stats: Total: 38 Min: 23 Max: 38 Avg: Stddev: 3.36 Homework 2 socket opened with 3 qns

Agenda Closer look at inverted indexes IR on Web –Crawling –Using Tags to improve retrieval Motivate need to exploit link structure –Segue into Social networks

Efficient Retrieval Document-term matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d 2 | d i w i1 w i2... w ij... w im 1/|d i | d n w n1 w n2... w nj... w nm 1/|d n | w ij is the weight of term t j in document d i Most w ij ’s will be zero.

Naïve retrieval Consider query q = (q 1, q 2, …, q j, …, q n ), nf = 1/|q|. How to evaluate q (i.e., compute the similarity between q and every document)? Method 1: Compare q with every document directly. document data structure: d i : ((t 1, w i1 ), (t 2, w i2 ),..., (t j, w ij ),..., (t m, w im ), 1/|d i |) –Only terms with positive weights are kept. –Terms are in alphabetic order. query data structure: q : ((t 1, q 1 ), (t 2, q 2 ),..., (t j, q j ),..., (t m, q m ), 1/|q|)

Naïve retrieval Method 1: Compare q with documents directly (cont.) Algorithm initialize all sim(q, d i ) = 0; for each document di (i = 1, …, n) { for each term t j (j = 1, …, m) if t j appears in both q and d i sim(q, d i ) += q j  w ij ; sim(q, d i ) = sim(q, d i )  (1/|q|)  (1/|d i |); } sort documents in descending similarities and display the top k to the user;

Observation Method 1 is not efficient – Needs to access most non-zero entries in doc-term matrix. Solution: Inverted Index –Data structure to permit fast searching. Like an Index in the back of a text book. –Key words --- page numbers. –E.g, precision, 40, 55, 60-63, 89, 220 – Lexicon – Occurrences

Search Processing (Overview) Lexicon search –E.g. looking in index to find entry Retrieval of occurrences –Seeing where term occurs Manipulation of occurrences –Going to the right page

Inverted Files A file is a list of words by position First entry is the word in position 1 (first word) Entry 4562 is the word in position 4562 (4562 nd word) Last entry is the last word An inverted file is a list of positions by word! POS FILE a (1, 4, 40) entry (11, 20, 31) file (2, 38) list (5, 41) position (9, 16, 26) positions (44) word (14, 19, 24, 29, 35, 45) words (7) 4562 (21, 27) INVERTED FILE

Inverted Files for Multiple Documents DOCID OCCUR POS 1 POS “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document LEXICON OCCURENCE INDEX One method. Alta Vista uses alternative …

Many Variations Possible Address space (flat, hierarchical) Position TF /IDF info precalculated Header, font, tag info stored Compression strategies

Using Inverted Files Several data structures: 1.For each term t j, create a list (inverted file list) that contains all document ids that have t j. I(t j ) = { (d 1, w 1j ), (d 2, w 2j ), …, (d i, w ij ), …, (d n, w nj ) } –d i is the document id number of the ith document. –Weights come from freq of term in doc –Only entries with non-zero weights should be kept.

Inverted files continued More data structures: 2.Normalization factors of documents are pre-computed and stored in an array: nf[i] stores 1/|d i |. 3.Lexicon: a hash table for all terms in the collection t j pointer to I(t j ) –Inverted file lists are typically stored on disk. –The number of distinct terms is usually very large.

Retrieval using Inverted files Algorithm initialize all sim(q, d i ) = 0; for each term t j in q { find I(t) using the hash table; for each (d i, w ij ) in I(t) sim(q, d i ) += q j  w ij ; } for each document di sim(q, d i ) = sim(q, d i )  nf[i]; sort documents in descending similarities and display the top k to the user; Use something like this as part of your Project..

Observations about Method 2 If a document d does not contain any term of a given query q, then d will not be involved in the evaluation of q. Only non-zero entries in the columns in the document-term matrix corresponding to the query terms are used to evaluate the query. Computes the similarities of multiple documents simultaneously (w.r.t. each query word)

Efficient Retrieval Example (Method 2): Suppose q = { (t1, 1), (t3, 1) }, 1/|q| = d1 = { (t1, 2), (t2, 1), (t3, 1) }, nf[1] = d2 = { (t2, 2), (t3, 1), (t4, 1) }, nf[2] = d3 = { (t1, 1), (t3, 1), (t4, 1) }, nf[3] = d4 = { (t1, 2), (t2, 1), (t3, 2), (t4, 2) }, nf[4] = d5 = { (t2, 2), (t4, 1), (t5, 2) }, nf[5] = I(t1) = { (d1, 2), (d3, 1), (d4, 2) } I(t2) = { (d1, 1), (d2, 2), (d4, 1), (d5, 2) } I(t3) = { (d1, 1), (d2, 1), (d3, 1), (d4, 2) } I(t4) = { (d2, 1), (d3, 1), (d4, 1), (d5, 1) } I(t5) = { (d5, 2) }

Efficient Retrieval After t1 is processed: sim(q, d1) = 2, sim(q, d2) = 0, sim(q, d3) = 1 sim(q, d4) = 2, sim(q, d5) = 0 After t3 is processed: sim(q, d1) = 3, sim(q, d2) = 1, sim(q, d3) = 2 sim(q, d4) = 4, sim(q, d5) = 0 After normalization: sim(q, d1) =.87, sim(q, d2) =.29, sim(q, d3) =.82 sim(q, d4) =.78, sim(q, d5) = 0 q = { (t1, 1), (t3, 1) }, 1/|q| = d1 = { (t1, 2), (t2, 1), (t3, 1) }, nf[1] = d2 = { (t2, 2), (t3, 1), (t4, 1) }, nf[2] = d3 = { (t1, 1), (t3, 1), (t4, 1) }, nf[3] = d4 = { (t1, 2), (t2, 1), (t3, 2), (t4, 2) }, nf[4] = d5 = { (t2, 2), (t4, 1), (t5, 2) }, nf[5] = I(t1) = { (d1, 2), (d3, 1), (d4, 2) } I(t2) = { (d1, 1), (d2, 2), (d4, 1), (d5, 2) } I(t3) = { (d1, 1), (d2, 1), (d3, 1), (d4, 2) } I(t4) = { (d2, 1), (d3, 1), (d4, 1), (d5, 1) } I(t5) = { (d5, 2) }

Efficiency versus Flexibility Storing computed document weights is good for efficiency but bad for flexibility. –Recomputation needed if tf and idf formulas change and/or tf and df information change. Flexibility is improved by storing raw tf and df information but efficiency suffers. A compromise –Store pre-computed tf weights of documents. –Use idf weights with query term tf weights instead of document term tf weights.

Distributing indexes over hosts At web scale, the entire inverted index can’t be held on a single host. How to distribute? –Split the index by terms –Split the index by documents Preferred method is to split it by docs (!) –Each index only points to docs in a specific barrel –Different strategies for assigning docs to barrels At retrieval time –Compute top-k docs from each barrel –Merge the top-k lists to generate the final top-k Result merging can be tricky..so try to punt it Idea –Consider putting most “important” docs in top few barrels This way, we can ignore worrying about other barrels unless the top barrels don’t return enough results Another idea –Split the top 20 and bottom 80% of the doc occurrences into different indexes.. Short vs. long barrels Do search on short ones first and then go to long ones as needed

Barrels vs. Collections We talked about distributing a central index onto multiple machines by splitting it into barrels A related scenario is one where instead of a single central document base, we have a set of separate document collections, each with their own index. You can think of each collection as a “barrel” Examples include querying multiple news source (NYT, LA Times etc), or “meta search engines” like dogpile and metacrawler that outsource the query to other search engines. –And we need to again do result retrieval from each collection followed by result merging –One additional issue in such cases is the “collection selection”— If you can call only k collections, which k collections would you choose? A simple idea is to get a sample of documents from each collection, consider the sample as a super document representing the collection. We now have n super-documents. We can do tf/idf weights and vector similarity ranking on top of the n super docs to pick the top k superdocs nearest to the query, and then call those collections.

Search Engine A search engine is essentially a text retrieval system for web pages plus a Web interface. So what’s new???

Some Characteristics of the Web Web pages are –very voluminous and diversified –widely distributed on many servers. –extremely dynamic/volatile. Web pages have –more structure (extensively tagged). –are extensively linked. –may often have other associated metadata Web search is –Noisy (pages with high similarity to query may still differ in relevance) –Adversarial! A page can advertise itself falsely just so it will be retrieved Web users are –ordinary folks (“dolts”?) without special training they tend to submit short queries. –There is a very large user community. Use the links and tags and Meta-data! Use the social structure of the web Need to crawl and maintain index Easily impressed

Crawlers: Main issues General-purpose crawling Context specific crawiling –Building topic-specific search engines…

SPIDER CASE STUDY

Web Crawling (Search) Strategy Starting location(s) Traversal order –Depth first –Breadth first –Or ??? Cycles? Coverage? Load? b c d e fg h i j

Robot (2) Some specific issues: 1.What initial URLs to use? Choice depends on type of search engines to be built. For general-purpose search engines, use URLs that are likely to reach a large portion of the Web such as the Yahoo home page. For local search engines covering one or several organizations, use URLs of the home pages of these organizations. In addition, use appropriate domain constraint.

Robot (7) Several research issues about robots: Fetching more important pages first with limited resources. –Can use measures of page importance Fetching web pages in a specified subject area such as movies and sports for creating domain-specific search engines. –Focused crawling Efficient re-fetch of web pages to keep web page index up-to-date. –Keeping track of change rate of a page

Storing Summaries Can’t store complete page text –Whole WWW doesn’t fit on any server Stop Words Stemming What (compact) summary should be stored? –Per URL Title, snippet –Per Word URL, word number But, look at Google’s “Cache” copy..and its “privacy violations”…

Mercator’s way of maintaining URL frontier  Extracted URLs enter front queue  Each URL goes into a front queue based on its Priority. (priority assigned Based on page importance and Change rate)  URLs are shifted from Front to back queues. Each Back queue corresponds To a single host. Each queue Has time t e at which the host Can be hit again  URLs removed from back Queue when crawler wants A page to crawl

Robot (4) 2.How to extract URLs from a web page? Need to identify all possible tags and attributes that hold URLs. Anchor tag: … Option tag: … Map: Frame: Link to an image: Relative path vs. absolute path:

Focused Crawling Classifier: Is crawled page P relevant to the topic? –Algorithm that maps page to relevant/irrelevant Semi-automatic Based on page vicinity.. Distiller:is crawled page P likely to lead to relevant pages? –Algorithm that maps page to likely/unlikely Could be just A/H computation, and taking HUBS –Distiller determines the priority of following links off of P

IR for Web Pages

Use of Tag Information (1) Web pages are mostly HTML documents (for now). HTML tags allow the author of a web page to –Control the display of page contents on the Web. –Express their emphases on different parts of the page. HTML tags provide additional information about the contents of a web page. Can we make use of the tag information to improve the effectiveness of a search engine?

Use of Tag Information (2) Two main ideas of using tags: Associate different importance to term occurrences in different tags. Use anchor text to index referenced documents airplane ticket and hotel Page 1 Page 2: Document is indexed not just with its contents; But with the contents of others descriptions of it

Google Bombs: The other side of Anchor Text You can “tar” someone’s page just by linking to them with some damning anchor text –If the anchor text is unique enough, then even a few pages linking with that keyword will make sure the page comes up high E.g. link your SO’s page with –“my cuddlybubbly woogums” –“Shmoopie” unfortunately is already taken by Seinfeld –For more common-place keywords (such as “unelectable” or “my sweet heart”) you need a lot more links Which, in the case of the later, may defeat the purpose Document is indexed not just with its contents; But with the contents of others’ descriptions of it

Use of Tag Information (3) Many search engines are using tags to improve retrieval effectiveness. Associating different importance to term occurrences is used in Altavista, HotBot, Yahoo, Lycos, LASER, SIBRIS. WWWW and Google use terms in anchor tags to index a referenced page. Qn: what should be the exact weights for different kinds of terms?

Use of Tag Information (4) The Webor Method (Cutler 97, Cutler 99) Partition HTML tags into six ordered classes: –title, header, list, strong, anchor, plain Extend the term frequency value of a term in a document into a term frequency vector (TFV). Suppose term t appears in the i th class tf i times, i = Then TFV = (tf 1, tf 2, tf 3, tf 4, tf 5, tf 6 ). Example: If for page p, term “binghamton” appears 1 time in the title, 2 times in the headers and 8 times in the anchors of hyperlinks pointing to p, then for this term in p: TFV = (1, 2, 0, 0, 8, 0).

Use of Tag Information (6) The Webor Method (Continued) Challenge: How to find the (optimal) CIV = (civ 1, civ 2, civ 3, civ 4, civ 5, civ 6 ) such that the retrieval performance can be improved the most? One Solution: Find the optimal CIV experimentally using a hill-climbing search in the space of CIV Details Skipped

Use of LINK information: Why? Pure query similarity will be unable to pinpoint right pages because of the sheer volume of pages –There may be too many pages that have same keyword similarity with the query The “even if you are one in a million, there are still 300 more like you” phenomenon –Web content creators are autonomous/uncontrolled No one stops me from making a page and writing on it “this is the homepage of President Bush” –and… adversarial I may intentionally create pages with keywords just to drive traffic to my page I might even use spoofing techniques to show one face to the search engine and another to the user So we need some metrics about the trustworthiness/importance of the page –Let’s look at social networks, since these topics have been investigated there..

Connection to Citation Analysis Mirror mirror on the wall, who is the biggest Computer Scientist of them all? –The guy who wrote the most papers That are considered important by most people –By citing them in their own papers »“Science Citation Index” –Should I write survey papers or original papers? Infometrics; Bibliometrics

What Citation Index says About Rao’s papers

Scholar.google