Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search engines fdm 20c introduction to digital media lecture 06.06.2007 warren sack / film & digital media department / university of california, santa.

Similar presentations


Presentation on theme: "Search engines fdm 20c introduction to digital media lecture 06.06.2007 warren sack / film & digital media department / university of california, santa."— Presentation transcript:

1 search engines fdm 20c introduction to digital media lecture 06.06.2007 warren sack / film & digital media department / university of california, santa cruz

2 outline what’s the difference between a web directory and a search engine? –ontologies –clusters what does a search engine do? –crawling –indexing –retrieving searching for difference rather than popularity –an explanation of the search engine experiment

3 what do search engines do? crawl the web index the pages found retrieve pages from the index in response to user queries

4 crawling/spidering the web

5 depth-first crawl source: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/SearchAnimations.html

6 breadth-first crawl source: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/SearchAnimations.html

7 indexing web pages found by the crawler 1.tokenization: break the document into words, punctuation, tags, and uris 2.stop words: remove any that are too frequently used to be informational (these are called stop words) 3.stemming: find the stem of each word 4.inverted index: create an index linking each word found to one or more documents

8 tokenization T’was was a dark and stormy night in the country manor. ==> T single-quote was a dark and stormy night in the country manor period

9 stop words T single-quote was a dark and stormy night in the country manor period ==> T single-quote was a dark and stormy night in the country manor period

10 stemming dark stormy night country manor ==> dark storm y night country manor

11 document vectors documents (i.e., web pages) are stored as vectors or lists; e.g., country dark manor night storm note that syntactic structure is lost

12 inverted index dictionary: make a file listing all of the words found in all of the web pages crawled postings: make a file listing all of the each web page in which each of the words was found encoding: remove any redundancies found in the dictionary and postings files so that they are as small as possible

13 inverted index / dictionary create a document vector for each web page combine all of the vectors together remove any duplicate words alphabetize the resulting list

14 inverted index / dictionary / example webpage 1: T’was a dark and stormy night in the country manor. ==> country dark manor night storm webpage 2: Now is the time for all good men to come to the aid of their country. ==> aid all come country good men time dictionary: aid all come country dark good manor men night storm time

15 inverted index / postings / example postings: aid (2) all (2) come (2) country (1&2) dark (1) good (2) manor (1) men (2) night (1) storm (1) time (2) the postings file just annotates the dictionary file with the documents in which the words appear

16 retrieval the postings file can support boolean queries for example, searching for webpages that contain the word “country” returns 1&2. for example, searching for the webpages that contain the words “country” and “manor” returns webpage 1

17 better retrieval how should the webpages retrieved be ordered? e.g., if webpage 1 and webpage 2 both contain the word “country” should 1 belisted before or after 2?

18 social networks and centrality Diane is central; Jane is not. See www.orgnet.com/sna.html

19 google’s page rank algorithm basic idea: if a webpage has a lot of other webpages that link to it, then it is central and probably an important page. If page A has pages T1...Tn which link to it (i.e., are citations) and C(A) is defined as the number of links going out of page A, then PageRank (PR) of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) +... + PR(Tn)/C(Tn)) –where the parameter d is a damping factor which can be set between 0 and 1 and is usually set to 0.85.

20 the difference is in the details The Boolean approach to information retrieval is the most widely used. Links to documents are returned only if they contain exactly the same words as your query. Even if a site is related in some way, it would never be retrieved unless it contains the user's search string. In order to get better results from this, most search engines finesse things a little. The results from the inverted file are modified by factors such as: where the search term appears in the item - whether the string of characters appears in specially tagged areas such as titles or head, or whether it appears early in the body text of the html document; how close the search words are together, or whether they form an 'exact phrase'; the frequency of occurrence of search terms in the inverted file; whether the search terms appear as keywords in the metatags in the html file's head; how frequently the terms appear in relation to the document's length (greater frequency indicating a probable greater relevance). These techniques gradually move over into semantic analysis. For instance selection by the frequency of occurrence of words in general in which very uncommon words get heavier weight. Matthew Fuller


Download ppt "Search engines fdm 20c introduction to digital media lecture 06.06.2007 warren sack / film & digital media department / university of california, santa."

Similar presentations


Ads by Google