1 Search Engines. 2 Topics to be discussed During the next few weeks we will be discussing search engines Talk divided into 4 parts –Introduction: structure.

1 Search Engines

2 Topics to be discussed During the next few weeks we will be discussing search engines Talk divided into 4 parts –Introduction: structure of the Web, search engines –Crawler –Indexer –Ranker

3 Goals Theoretical: To better understand Web search engines, i.e., –Fundamental concepts –Main challenges –Design issues –Implementation techniques and algorithms –Novel information system Practical: To understand the factors that influence the ranking of a site, in order to create a highly ranked site

4 Introduction: Structure of the Web

5 Studying Web Evolution A new research field is trying the “measure” aspects of the web Mostly accomplished by downloading large portions of the web and studying them (or their evolution)

6 Web Dynamics Size –~10 billion Public Indexable pages –10kB / page  100 TB (1 TB = 1000 GB) –Doubles every 18 months Dynamics –33% change weekly –8% new pages every week –25% new links every week

7 Weekly change Fetterly, Manasse, Najork, Wiener 2003

8 Bowtie Structure SCC: Strongly connected component OUT: Reachable from core IN: Reach the core

9 Other Characteristics Distributed authorship: Many authors write in different languages, using different terminologies, and different formats Spamming: Deliberate misrepresentation of the contents of a page Duplication: 20%-40% (near) duplicates Linkage: Average of 8 links per page

10 Introduction: Search Engines

11 Number of Searches Per Day (January 2003) Service ProviderNum. Searches (millions) Google250 Overture167 Inktomi80 LookSmart45 FindWhat33 Ask Jeeves20 Alta Vista18 How do we know?

12 Types of Searches Informational: want to learn something (40%) –Example: children pacifier teeth Navigational: want to go to a page (35%) –Example: cnn Transactional: want to do something –Example: download free grep windows Others

13 Characteristics of Searches Queries –Short: Average (2001) of 2.54 terms. 80% with < 3 words –Imprecise –Simple queries (no use of operators): 80% User reactions –85% look over one result screen (mostly “above the fold”) –78% queries not modified

14 Building a Search Engine: Challenges What do you think?

15 Some of the Main Challenges Speed of answer: The web is huge. However, search engines generally return answers quickly. Is this easy? –Example: "Find File" in Windows is much slower, even though it has much less to search Ranking: How can the search engine make sure to return the "best" pages on the top?

16 More Challenges Coverage: How can a search engine be sure that it covers sufficiently large portions of the web? –IN part –Dynamic pages Storage: Data from web pages is stored locally at the search engine. How can so much information be stored using a reasonable amount of memory?

17 Search Engine Components Index Repository: Storage of web pages (and additional data) Crawler: Program that "crawls" the web to find web pages Indexer: Program that gets a web page (found by the crawler) and inserts the data from the page into the Index Repository Note that the Crawler and Indexer are constantly running in the "background". They are NOT run for specific user queries

18 Search Engine Components User Interface: This is the Web page in which the user enters his query

19 Search Engine Components Query processor: Gets the query from the user interface and finds satisfying documents from the index repository Ranker: Ranks the documents found according to how well they "match" the query (much more about this later on)

20 Index Repository Query Processor Crawler Indexer Web Ranker

21 Index Repository Query Processor Crawler Indexer Web Ranker Many users querying (and many crawlers crawling) running in parallel. Challenge: Coordinate between all these processes.

22 We Will Discuss Index Repository Crawler Ranker

23 Index Repository and Query Processor

24 The Problem The Problem We want to store (information about) a lot of pages Constraints: –Avoid using too much memory –Allow quick processing of queries Dimensions of the problem –Amount of memory available –Types of queries supported

25 Option 1: Store “As Is” Pages are stored "as is" as files in the file system Can find words in files using the tool: grep How do we answer the queries: –rain and Spain ? –rain and not Spain ? –not Spain ? Is this space efficient? Are queries processed quickly?

26 Option 2: Relational Database UrlUid... UidWid... URL_INDEXAPPEARS UrlWord... APPEARS Two options. Which is better? WordWid... WORD_INDEX

27 Relational Database Example The rain in Spain falls mainly on the plain. Rain, rain go away. http://poems.com/suzie.html http://www.myfairlady.com/script.html

28 Relational Database Example URL_INDEX WORD_INDEX APPEARS Note the case- folding

29 Is This a Good Idea? Does it save more space than saving as files? How are queries processed? Example query: rain –SELECT Url FROM URL_INDEX U, WORD_INDEX W, APPEARS A WHERE U.Uid = A.Uid and W.Wid=A.Wid and W.Word='rain' How can we answer the queries: –rain and go ? –rain and not Spain ?

30 Is it good to use a relational database? If a word appears in a thousand documents, then its wid will be repeated 1000 times. Why waste the space? If a word appears in a thousand documents, we will have to access a thousand rows in order to find the documents Does not easily support queries that require multiple words

31 Option 3: Bitmaps There is a vector of 1s and 0s for each word. Queries are computed using bitwise operations on the vectors – efficiently implemented in the hardware.

32 Option 3: Bitmaps How would you compute: Q1 = rain Q2 = rain and Spain Q3 = rain or not Spain

33 Bitmaps Tradeoffs Bitmaps can be efficiently processed However, they have high memory requirements. Example: –1M of documents, each with 1K of terms –500K distinct terms in total –What is the size of the matrix? –How many 1s will it have? Summary: A lot of wasted space for the 0s

34 The Index Repository

35 Two Structures Lexicon: –list of all terms in the documents –For each term in the document, store a pointer to its list in the inverted file Inverted Index: –For each term in the lexicon, an inverted list that stores pointers to all occurrences of the term in the documents. –Usually, pointers = document numbers –Usually, pointers are sorted –Sometimes also store term locations within documents (Why?)

36 The Structures: Intuition rain (3) go (2) away (2) in (3) spain (3) falls (3) mainly (2) on (2) the (3) plain (1) Inverted Index 1,3,4 1,5 2,4 3,7,10 1,2,5 6,7,8 1,9 6,8 1,3,6 10 Lexicon

37 Inverted Index: Structure and Compression

38 Inverted Index (1) DocText 1Pease porridge hot, pease porridge cold 2Pease porridge in the pot, 3Nine days old. 4Some like it hot, some like it cold 5Some like it in the pot 6Nine days old cold days hot in it like nine old pease porridge pot some the Posting Lists: Sorted, by document number

39 Inverted Index (2) DocText 1 Pease porridge hot, pease porridge cold 2Pease porridge in the pot, 3Nine days old. 4Some like it hot, some like it cold 5Some like it in the pot 6Nine days old cold days hot in it like nine old pease porridge pot some the Index is bigger, for 2 reasons

40 Compressing the Inverted Index The inverted index is very large (even when exact locations are not stored) Many methods to compress the inverted index We discuss one method using  -code

41 Storing Gaps Consider a posting list for some term – Instead of storing document numbers, we can store gaps between numbers – Does this same space? –Suppose there are N documents –What is the size of the largest possible gap? –How many bits are needed for this number?

42 Compressing the Gaps Gaps are smaller than document numbers, so they are good candidates for compression –What is the largest possible sum of all numbers in a posting list? Compress these using  -codes –Can use the same method to compress document numbers, if we do not want to store gaps Main point: numbers will be stored in variable length sizes, instead of in a fixed number of bits

43 Unary Codes and Binary Codes Unary Code: The unary code for a number x consists of x-1 occurrences of 1 and then 1 occurrence of 0 –What is the unary code of 1? of 2? of 3? Binary Code: (standard way of coding in a computer). Number is coded using 0-s and 1-s. –Converting to binary: continuously divide the number by 2, writing the remainder from right to left, until nothing left to divide. –need log x bits to write x in binary –Example on the blackboard: converting 118

44  -Codes  -code: represents x as –a unary code for 1+  log x , –followed by a binary code of length  log x  of x-2  log x  Notes: –this code is of variable length, i.e., the length is dependent on x (Why is that good for compression?) –A series of  -codes can be decoded by reading the unary part to know how many bits appear in the binary part Examples on the blackboard: compressing and decompressing 1, 5, 9

45  -codes Similar to  -codes, but –write prefix as  -code, instead of unary code What is  -code of 9? What is  -code of 1000000? How do  -codes and  -codes compare?

46 Summary The inverted index allows quick access to the documents “matching” a term Technically, the posting list of each term contains: –list of document numbers, in ascending order or –list of document gaps –(or list of document numbers with positions, written with/or without gaps) Each number is encoded using  -codes –Other compression techniques also possible

47 The Lexicon: Structure and Compression

48 The Lexicon Array of terms, along with a pointer to the appropriate posting list Terms are sorted alphabetically, to allow search in log(n) time (n is the number of all terms) Goals: –To fit the entire lexicon in main memory, to limit the number of required disk accesses –To have logarithmic search We discuss several different alternatives for storing the lexicon

49 Fixed-length Strings Assume there is a maximum length for strings. Store triples of –term, –term frequency (number of documents with the term), –disk address of posting list

50 Fixed-length Strings Suppose: –1 million terms –each string takes 20- bytes –frequency values and disk addresses each take 4 bytes What is the size of the index? jezebel20 jezer3 jezerit1 jeziah1 jeziel1 jezliah1 jezoar1 jezrahiah1 jezreel39 Term tfreq tdisk addresses

51 Concatenated Strings Previous way of storing lexicon is wasteful, since all strings take the same number of bytes, even if they are shorter Instead, concatenate all strings into one long string, use pointers to show where string starts

52 Concatenated Strings Suppose that the average length of a word is 8 letters. What is the size of the index now? How can we know where a word ends (no separators) 20 3 1 1 1 1 1 1 39 freq t disk addresses address of term t …jezebeljezerjezeritjeziahjeziel…

53 Front Coding Did you see how many times “jez” appeared in the string Size of the string can be reduced if we take advantage of common prefixes With front coding two integers are stored with each word –first number indicates how many prefix characters are the same as the previous word –second number indicates how many characters are left after removing prefix

54 Front Coding: The idea jezebel3,4,ebel jezer4,1,r jezerit5,2,it jeziah3,3,iah jeziel4,2,el jezliah3,4,liah jezoar3,3,oar jezrahiah3,6,rahiah jezreel jezreelites jibsam jidlaph Word before jezebel was jezaniah Can you fill these in?

55 Front Coding Example jezebel3,4,ebel jezer4,1,r jezerit5,2,it jeziah3,3,iah jeziel4,2,el …ebelritiahel… 20 34 3 41 1 52 1 33 1 42 freq t disk addresses address of term t prefix sizesuffix size

56 Front Coding Example Assuming 2.5 letters in common prefix (on average), what is the size of the lexicon? Is search still efficient? …ebelritiahel… 20 34 3 41 1 52 1 33 1 42 freq t disk addresses address of term t prefix sizesuffix size Is this needed?

57 3-in-4 Front Coding Front coding saves space, but binary search of the index is no longer possible To allow for binary search, “3-in-4” front coding should be used In this method, in every block of 4 words, the first is completely given, and all others are front-coded Binary search can be based on the complete words to find the correct block

58 3-in-4 Front Coding: The idea Word before jezebel was jezaniah Can you fill these in? jezebel,7,jezebel jezer4,1,r jezerit5,2,it jeziah3,0,aih jeziel,6,jeziel jezliah3,4,liah jezoar3,3,oar jezrahiah3,,rahiah jezreel jezreelites jibsam jidlaph What does the actual index “look like”?

59 Summary The lexicon allows quick search for a term After finding a term, we can find the posting list in the index file, since a pointer to this is given There are different ways to store the lexicon, which have time/space tradeoffs At this point, you should be able to guess how query processing for simple queries can be achieved Before discussing the query processing, we consider additional structures that can be used to answer more advanced queries

60 Partially Specified Query Terms

61 Finding Lexicon Entries Suppose that we are given a term t, we can find its entry in the lexicon using binary search What happens if we are given a term with a wild card? How can we find the following terms in the lexicon? –lab* –*or –lab*r

62 Indexing Using n-grams Decompose all terms into n-grams for some small value of n –n-grams are sequences of n letters in a word –use $ to mark beginning and end of word –digram = 2-gram Example: Digrams of labor are $l, la, ab, bo, r$ We store an additional structure, that, for each digram, points has a list of terms containing that digram –actually, store a list of pointers to entries in the lexicon

63 Example Term NumberTerm 1abhor 2bear 3laaber 4labor 5laborator 6labour 7lavacaber 8slab DigramTerm numbers $a1 $b2 $l3,4,5,6,7 $s8 aa3 ab1,3,4,5,6,7,8 bo4,5,6 la3,4,5,6,7,8 or1,4,5 ou6 ra5 ry5 r$ sl Can you fill this in? alphabetically sorted

64 Term NumberTerm 1abhor 2bear 3laaber 4labor 5laborator 6labour 7lavacaber 8slab DigramTerm numbers $a1 $b2 $l3,4,5,6,7 $s8 aa3 ab1,3,4,5,6,7,8 bo4,5,6 la3,4,5,6,7,8 or1,4,5 ou6 ra5 ry5 r$ sl To find lab*r, you would look for terms in common with $l, la, ab, r$ and then post- process to ensure that the term does match

65 Indexing Using Rotated Lexicons In wildcard queries are common, we can save on time at the cost of more space Rotated indexed can find the matches of any wildcard query with a single wildcard in one binary search An index entry is stored for each letter of each term –labor, would have 6 pointers: one for each letter + one for the beginning of the word

66 Partial Example Term NumberTerm 1abhor 2bear 4labor 6labour Rotated FormAddress $abhor(1,0) $bear(2,0) $labor(4,0) $labour(5,0) abhor$(1,1) abor$l(4,2) abour$l(6,2) r$abho(1,5) r$bea(2,4) r$labo(4,5) r$labou(6,6) Note: We do not actually store the rotated string in the rotated lexicon. The pair of numbers is enough for binary search

67 Partial Example Term NumberTerm 1abhor 2bear 4labor 6labour Rotated FormAddress $abhor(1,0) $bear(2,0) $labor(4,0) $labour(5,0) abhor$(1,1) abor$l(4,2) abour$l(6,2) r$abho(1,5) r$bea(2,4) r$labo(4,5) r$labou(6,6) How would you find the terms for: lab* *or *ab* l*r l*b*r

68 Summary We now know how to find terms that match a wildcard query Basic steps for query evaluation for a wildcard query –lookup the wildcard-ed words in an auxiliary index, to find all possible matching terms –given these terms proceed with normal query processing (as if this was not a wildcard query) But, we have not yet explained how normal query processing should proceed! This is the next topic.

69 Query Processing

70 Intuition A query is Boolean combination of terms, using AND, OR and NOT A conjunctive query only uses AND Suppose that you had a query –Hello AND World How would you answer such a query?

71 Conjunctive Queries 1.For each query term t a)Stem t b)Search the lexicon c)Record f t (frequency) and address I t 2.Identify term t with smallest f t 3.Read I t. Set C:=I t 4.For each remaining term t’ in increasing order of f t’ a)Read the entry I t’ b)For each d in C, if d not in I t’, then remove d from C 5.For each d in C a)Look up the address of d and return document to the user

72 Conjunctive Queries Start with the lowest frequency term t, since the total number of documents in the answer is bound by the length of I t –Intermediate results will not take too much space Continue processing in frequency order so that lookups of terms is not too expensive

73 Two Strategies for Computing Intersection We are given two lists of documents C 1 and C 2 –Suppose that C 1 is smaller than C 2 If C 1 and C 2 are sorted, they can be merged in time |C 1 | + |C 2 | If C 2 allows for random access, we can look up each of the terms in time |C 1 | x log(C 2 ) Which is preferable?

74 Merge in Linear Time Suppose we want to answer the query –rain AND spain Always move ahead pointer with smaller value rain spain 2 1 48163264 235813 2 2 8 8

75 Non-conjunctive Queries A non-conjunctive query also uses at least one of the operators: OR, NOT Conjunction of disjunctions, example: –(text OR data OR image) AND (compression OR compaction) AND (retrieval OR indexing OR archiving) Estimate the size of each disjunct as the sum of the frequencies of its terms. Process as before (by considering all terms of a single disjunct simultaneously)

76 Question Sam claims that “if a query Q is a conjunctive query of terms containing wildcards, then Q is actually processed as a conjunction of disjunctions” Is Sam right?

77 More Issues: What data should be stored?

78 Choosing What Data To Store We would like the user to be able to get as many relevant answers to his query Examples: –Query: computer science. Should it match Computer Science? –Query: data compression. Should it match compressing data? –Query: Amir Perez. Should it match Amir Peretz? The way we store the data in our lexicon will affect our answers to the queries

79 Case Folding Normally accepted to perform case folding, i.e., to reduce all words to lower-case form before storing in the lexicon Use’s query is transformed to lower case before looking for the terms in the index What affect does this have on the lexicon size?

80 Stemming Suppose that a user is interested in finding pages about “running shoes” –We may also want to return pages with shoe –We may also want to return pages with run or runs Solution: Use a stemmer –Stemmer returns the stem (שורש) of the word Note: This means that more relevant answers will be returned, as well as more irrelevant answers! –Example: cleary AND witten => clear AND wit

81 Porter Stemmer A multi-step, longest-match stemmer. –Paper introducing this stemmer can be found onlineonline Notation –v vowel(s) –cconstant(s) –(vc) m vowel(s) followed by constant(s), repeated m times Any word can be written: [c](vc) m [v] –brackets are optional –m is called the measure of the word We discuss only the first few rules of the stemmer

82 Porter Stemmer: Step 1a Follow first applicable rule SuffixReplacementExamples ssessscaresses => caress iesi ponies => poni ties => ti ss caress => caress snullcats => cat

83 Porter Stemmer: Step 1b ConditionsSuffixReplacementExamples (m > 0)eedeefeed -> feed agreed -> agree (*v*)ednullplastered -> plaster bled -> bled (*v*)ingnullmotoring -> motor sing -> sing *v* - the stem contains a vowel Follow first applicable rule

84 Stop Words Stop words are very common words that generally are not of importance, e.g.: the, a, to Such words take up a lot of room in the index (why?) They slow down query processing (why?) They generally do not improve the results (why?) Some search engines do not store these words at all, and remove them from queries

85 Google's solution Problem: Can we answer the query to be or not to be? Store stopwords in index Remove stopwords from query, unless: –there is a + before the stop word, e.g., queen +of England –there are quotation marks around the word, e.g., –"queen of England" Compare these two: noquotes quotesnoquotesquotes

86 Spelling Corrections: Soundex Many different algorithms have been used for spelling corrections We discuss the Soundex algorithm. It was originally developed in 1918 for storing census information

87 The Soundex Algorithm 1. Keep the first letter of the word 2. Change all occurrences of the following letters to 0: A, E, I, O, U, H, W, Y 3. Change the following letters to the given digits: –B, F, P, V  1 –C, G, J, K, Q, S, X, Z  2 –D, T  3 –L  4 –M, N  5 –R  6

88 The Soundex Algorithm (cont.) 4. Remove one out of each pair of consecutive identical digits 5. Remove all 0s 6. Pad the resulting string with trailing zeros and return the first 4 positions Example: Soundex for Herman How can Soundex be used in query processing?

1 Search Engines. 2 Topics to be discussed During the next few weeks we will be discussing search engines Talk divided into 4 parts –Introduction: structure.

Similar presentations

Presentation on theme: "1 Search Engines. 2 Topics to be discussed During the next few weeks we will be discussing search engines Talk divided into 4 parts –Introduction: structure."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Search Engines. 2 Topics to be discussed During the next few weeks we will be discussing search engines Talk divided into 4 parts –Introduction: structure.

Similar presentations

Presentation on theme: "1 Search Engines. 2 Topics to be discussed During the next few weeks we will be discussing search engines Talk divided into 4 parts –Introduction: structure."— Presentation transcript:

Similar presentations

About project

Feedback