Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Similar presentations


Presentation on theme: "Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,"— Presentation transcript:

1 Search Technologies

2 Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java Solr – Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.

3 Search Engine Ranking Criteria

4 Yahoo! been in the search game for many years. is better than MSN but nowhere near as good as Google at determining if a link is a natural citation or not. has a ton of internal content and a paid inclusion program, both of which give them incentive to bias search results toward commercial results things like cheesy off topic reciprocal links still work great in Yahoo!

5 MSN (bing) new to the search game is bad at determining if a link is natural or artificial in nature due to sucking at link analysis they place too much weight on the page content their poor relevancy algorithms cause a heavy bias toward commercial results likes bursty recent links new sites that are generally untrusted in other systems can rank quickly in MSN Search things like cheesy off topic reciprocal links still work great in MSN Search

6 Google has been in the search game a long time, and saw the web graph when it is much cleaner than the current web graph is much better than the other engines at determining if a link is a true editorial citation or an artificial link looks for natural link growth over time heavily biases search results toward informational resources trusts old sites way too much a page on a site or subdomain of a site with significant age or link related trust can rank much better than it should, even with no external citations they have aggressive duplicate content filters that filter out many pages with similar content if a page is obviously focused on a term they may filter the document out for that term. on page variation and link anchor text variation are important. a page with a single reference or a few references of a modifier will frequently outrank pages that are heavily focused on a search phrase containing that modifier crawl depth determined not only by link quantity, but also link quality. Excessive low quality links may make your site less likely to be crawled deep or even included in the index. things like cheesy off topic reciprocal links are generally ineffective in Google when you consider the associated opportunity cost

7 Ask looks at topical communities due to their heavy emphasis on topical communities they are slow to rank sites until they are heavily cited from within their topical community due to their limited market share they probably are not worth paying much attention to unless you are in a vertical where they have a strong brand that drives significant search traffic

8 History SMART – Salton’s Magic Information Retrieval of Text – Vector Space Model – Relevance feedback algorithm (customization) – Latent Semantic Indexing (LSI)

9 Basic Vector Space Algo Vanilla Search Algo Key word search (ignore search modifiers e.g. not, and, this, their, is, or, of, and stop words Remove punctuation marks Reduce words to their root form (stemming) – Combination of suffix and prefix – Eg: students  student swam  swim lemmatization  stochastic algorithm science, scientist??

10 Documents to be indexed Document 1 – Search technologies have been around for over forty years. Over this time, their user base expanded first from scientists and technologists to information professionals, and finally from information professionals to pretty much everyone.

11 Document 2 – Math and Physics students are familiar with the challenge of finding the unambiguous “right answer”. The same is not true for information retrieval. Finding the “right document” may be as much art as science.

12 Document 3 – Many serial killers do not suffer from psychosis and appear to be quite normal. Search for such killers can take years, even with the latest police technologies, and the results are often shocking.

13 Stop words for removal Search technologies have been around for over forty years. Over this time, their user base expanded first from scientists and technologists to information professionals, and finally from information professionals to pretty much everyone. Math and Physics students are familiar with the challenge of finding the unambiguous “right answer”. The same is not true for information retrieval. Finding the “right document” may be as much art as science. Many serial killers do not suffer from psychosis and appear to be quite normal. Search for such killers can take years, even with the latest police technologies, and the results are often shocking.

14 Stemming Changes Identified search technology around forty years time user base expanded first science technology information professionals finally information professionals pretty much everyone math physics students familiar challenge finding unambiguous right answer information retrieval finding right document much art science many serial killers suffer psychosis appear normal search killers take years latest police technology results shocking

15 Unique words identified Search[1] technology[2] around[3] forty[4] year[5] time[6] user[7] base[8] expand[9] first[10] science[11] technology[2] information[12] professional[13] final[14] information[12] professional[13] pretty[15] much[16] everyone[17] math[18] physics[19] student[20] familiar[21] challenge[22] find[23] unambiguous[24] right[25] answer[26] information[12] retrieval[27] find[23] right[25] document[28] much[16] art[29] science[11] many[30] serial[31] killer[32] psychosis[33] appear[34] normal[35] search[1] killer[32] take[36] year[5] latest[37] police[38] technology[2] result[39] shock[40]

16 Search Ditionary [1] search [2] technology [3] around [4] forty [5] year [6] time………[40] shock

17 Representing documents as 40-dimensional vectors Values are in form of : Doc1(1:1, 2:2, 3:1, 4:1, 5:1, 6:1, 7:1,….,13:2,14:1, 15:1,…, 17:1, 18:0, 19:0,…,40:0) Doc2(1:0, 2:0, 3:0,…,11:1,12:1,…,16:1,17:0,18:1, 19:1, 20:1,..,29:1,30:0,31:0,….,40:0) Doc3(1:1,2:1,3:0,4:0,5:1,6:0,7:0,8:0,…,29:0, 30:1,31:2,32:2,33:1…,40:1)

18 Handling the Query “the promise of search technologies” the promise of search technology search and technology are present in dictionary, but “promise” is not so it will be avoided Hence the search becomes search technology, which is equivalent to (1:1, 2:1)....creating a new vector Converting it to 40 dimensional array (1:1, 2:1, 3:0, 4:0,….,40:0) Finally find the shortest distance (best match) between previously stored vectors.

19 Enhancements Weighting multiple occurrences – (1:1000, 2:1000) Weighting for phrases – Search technology – Police technology – Information professional – Information retrieval Word clustering – Search/retrieval/find – Technology/science/math/physics – First/final/latest Custom biases

20 Google Page ranking PR(A) = (1-d)+d (PR(T 1 )/C(T 1 ) + ….. + PR(T n )/C(T n )) A  page in question T 1 …T n  documents that reference PR  page rank C(T i )  total number of links to outside resources on page Ti D  heuristic damping factor usually set to 0.85

21 Web Spiders Selection policy Re-visit policy


Download ppt "Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,"

Similar presentations


Ads by Google