Searching the Web Mark Levene (Follow the links to learn more!)

Mechanics of a Typical Search Query submitted to Google

Mechanics of a Typical Search Google results for the query

Search Engines as Information Gatekeepers Search engines are becoming the primary entry point for discovering web pages. Ranking of web pages influences which pages users will view. Exclusion of a site from search engines will cut off the site from its intended audience. The privacy policy of a search engine is important.

Search Engine Wars The battle for domination of the web search space is heating up! The competition is good news for users! The way in which advertising is combined with search results is crucial! There are serious implications if one of the search engines will manage to dominate the space!

Google Verb “google” has become synonymous with searching for information on the web. Has raised the bar on search quality, Has been the most popular search engine in the last few years. Had a very successful IPO in August 2004. Is innovative and dynamic.

Yahoo! Synonymous with the dot-com boom, probably the best known brand on the web. Started off as a web directory service. Has very strong advertising and e- commerce partnerships. Acquired leading search engine technology in 2003.

MSN Search Synonymous with PC software. Remember its victory in the browser wars with Netscape. Developed its own search engine technology only recently, officially launched in Feb. 2005. May link web search into its next version of Windows.

Others Ask Jeeves –Specialises in natural language question answering. –Search driven by Teoma.Teoma Looksmart –Has its own directory service. –Search driven by Wisenut.Wisenut …

Statistics from search engine logs Statistic (Year) AltaVista (1998) AlltheWeb (2002) Excite (2001) average terms per query 2.352.302.60 average queries per session 2.022.802.30 average result pages viewed 1.391.551.70 usage of advanced search features 20.4%1.0%10.0%

Experiment with search engine query syntax Default is AND, e.g. “computer chess” normally interpreted as “computer AND chess”, i.e. both keywords must be present in all hits. “+chess” in a query means the user insists that “chess” be present in all hits. “computer OR chess” means either keywords must be present in all hits. “”computer chess”” means that the phrase “computer chess” must be present in all hits.

The most popular search keywords AltaVista (1998) AlltheWeb (2002) Excite (2001) sexfree appletsex pornodownloadpictures mp3softwarenew chatuknude

Search Engine Architecture

Crawler Algorithm A crawler is a program that traverses web pages, downloads them for indexing and follows (or harvests) the hyperlinks on the downloaded pages. A crawler will typically start from a multitude of web pages and aims to cover as much of the indexable web as possible. Standard algorithm used breadth-first strategy. Focused crawlers use best-first strategy.

Search Index - Inverted File Also store position of word in web page and info. on HTML structure.

The query engine The interface between the search index, the user and the web. Algorithmic details of commercial search engines kept as trade secrets. First step is retrieval of potential results from the index. Second step is the ranking of the results based on their “relevance” to the query.

Vector Space Model Vector Space Model – Content Relevance

Term Frequency (TF) Count number of occurrences of each term. Bag of words approach Ignore stopwords such as is, a, of, the, … Stemming - computer is replaced by comput, as are its variants: computers, computing computation,computer and computed. Normalise TF by dividing by doc length, byte size of doc or max num of occurrences of a word in the bag. chess computer programming chess game chess game is a

Inverse Document Frequency (IDF) N is number of documents in the corpus. ni is number of docs in which word i appears. Log dampens the effect of IDF. IDF is also number of bits to represent the term.

Ranking with TF-IDF j – refers to document j i – refers to word (or term) i in doc j q – is the query which is a sequence of terms scorej - is the score for document j given q Rank results according to the scoring function.

Content Relevance Phrase matching. Synonyms. URL analysis. Date last updated. Spell checking. Home page detection.

Link Text (Anchor Text) Include link text for a link pointing to a web page, say P, as part of the content of P Link text is very useful in finding home pages. Link text behaves like user queries –They act as short summaries –They often match query terms

HTML Weighting Class NameHTML tags 1) Plain TextNone of the above 2) StrongSTRONG, B, EM, I, U 3) ListDL, OL, UL 4) HeaderH1, H2, H3, H4, H5, H6 5) AnchorA 6) TitleTITLE Normal retrieval = (111101) ranking with TF-IDF (181882) – 39.6% improvement. (181782) – 48.3% improvement – C2, C4 and C5. (181582) - 43.5% improvement Meta tag text is mostly ignored by search engines

Factor in Link Metrics Multilply by PageRank of document (web page). We do not know exactly how Google factors in the PR, it may be that log(PR) is used.

Popularity Based Metrics Factor in users’ opinions as represented in the query logs. Document space modification adjusts the weights of keywords in popular pages. Clickthrough data can also be taken into account to improve the ranking of search engine query results.

Precision and Recall Precision is Overlap/Retrieved (first results page retrieved is most important). Recall is Overlap/Relevant (for web search recall is related to index coverage).

Typical Recall-Precision Curve Top-n precision – proportion of relevant for top n ranked results. Measure top-n precision at fixed recall point for n being 0% to 100% of the ranked results.

Probabilistic IR Basic question: What is the probability that a document, D, is relevant to a query Q? Probability ranking assumption: If docs retrieved are ordered by decreasing probability of relevance then the overall effectiveness of the system is the best obtainable given the input documents.

Bayes Formulation of Relevance R – relevance of D with respect to a query Q D – document (web page) P(R|D) – probability that a page is relevant given its description (or representation) NR – D not relevant with respect to Q

Naïve Bayes Independence Assumption n – the number of words in D wi – the word in position i in D Also assume that the probability of a word is independent of its position in the document.

Computing the probabilities ri – number of times wi occurs in relevant docs RW - number of words in relevant docs (counting duplicate words multiple times, since docs are bags) nri – number of times wi occurs in relevant docs NRW – number of words in non relevant docs P(R) is the number of relevant docs with respect to Q. P(NR) is the number of docs which are not relevant with respect to Q. c – is a smoothing constant greater or equal to one

Is a Document Relevant? Assume we have a set of training examples of relevant and non-relevant documents to compute P(wi|R) and P(wi|NR) for words wi. The user could mark docs as R or NR. Choose the class (R or NR) which has higher probability.

Ranking Documents D1 is more relevant than D2 given Q, where –nD1 - the number of words in D1 –nD2 – the number of words in D2 If we are ranking, then P(wi|R) can be approximated on the basis of a document with a weighting function such as TF-IDF.

Other types of Search Engine Directory – e.g. Yahoo! (Open Directory)Yahoo! Open Directory MetaSearch – e.g. Dogpile (Mamma)DogpileMamma Clustering – e.g. ClustyClusty Question Answering – e.g. Ask Jeeves WolframAlpha, True Knowledge, & Google, Yahoo and BingAsk Jeeves WolframAlphaTrue Knowledge Visual – e.g. QuinturaQuintura Collaborative – e.g, Omgili, Sproose, AfterVoteOmgili,SprooseAfterVote Human Input – e.g. ChaChaChaCha Social Tagging – e.g. Blekko, MrTaggyBlekkoMrTaggy

Directions in Search Mobile search – e.g. Google MobileGoogle Mobile Local search – e.g. Google LocalGoogle Local Video search – e.g. YouTubeYouTube Image search – e.g. PicsearchPicsearch Audio search – e.g. Yahoo AudioYahoo Audio Blog search – e.g. TechnoratiTechnorati Social bookmarking - e.g. DeliciousDelicious Some new ideas - e.g. CuilCuil

Paid Inclusion and Paid Placement Paid inclusion – payment to speed up inclusion in the search index. Pay-Per-Click (PPC) or Cost-Per-Click (CPC) – payment for being advertised on the search engine’s sponsored results list. The sponsored list should be separated from the organic list. PPC is a major revenue source for search engines. Click fraud is a problem!

Behavioural Targeting Contextual targeting is a weaker form based on a single user session. Personalised advertising in order to increase the effectiveness of advertising. Data collected from individual users, normally through cookies.

Pay-Per-Action (PPA) Charge the advertiser only when an action takes place such as a purchase, a download or any other trackable action. Ad network will require the advertiser to place a script in the web page triggering the action. Some level of trust is needed.

Searching the Web Mark Levene (Follow the links to learn more!)

Similar presentations

Presentation on theme: "Searching the Web Mark Levene (Follow the links to learn more!)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Searching the Web Mark Levene (Follow the links to learn more!)

Similar presentations

Presentation on theme: "Searching the Web Mark Levene (Follow the links to learn more!)"— Presentation transcript:

Similar presentations

About project

Feedback