Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,

Slides:



Advertisements
Similar presentations
PROMOTING ONLINE GUIDES: EASIER THAN YOU THINK Artur Potosyan, Armenia twitter.com/healthrights facebook.com/healthrights.
Advertisements

Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.
Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
CPSC 335 Application of Trees Dr. Marina Gavrilova Computer Science University of Calgary Canada.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
Marketing Your Web Site Increase Your Web Traffic As Designer 4 You creates your website a key component in that design is to create a results oriented.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Databases & Data Warehouses Chapter 3 Database Processing.
SEO & Content Marketing | April 2015 bradforster.org Winning at SEO & Content Marketing.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Search Engine Interfaces search engine modus operandi.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Revolutionizing enterprise web development Searching with Solr.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Keywords and Search Results & Upcoming Updates August 30, 2011.
Web Searching. How does a search engine work? It does NOT search the Web (when you make a query) It contains a database with info on numerous Web sites.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Chapter 6: Information Retrieval and Web Search
Search engines are used to for looking for documents. They compile their databases by employing "spiders" or "robots" to crawl through web space from.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Search Engines By: Faruq Hasan.
Google, Bing, MSN, Yahoo! and many more!. How useful are search Engines? We discussed some of the techniques involved in the previous lesson. Search Engines.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
1 CS 430: Information Discovery Lecture 5 Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Search Engine Optimization
Dr. Frank McCown Comp 250 – Web Development Harding University
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Information Retrieval
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Chapter 5: Information Retrieval and Web Search
Presentation transcript:

Search Technologies

Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java Solr – Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.

Search Engine Ranking Criteria

Yahoo! been in the search game for many years. is better than MSN but nowhere near as good as Google at determining if a link is a natural citation or not. has a ton of internal content and a paid inclusion program, both of which give them incentive to bias search results toward commercial results things like cheesy off topic reciprocal links still work great in Yahoo!

MSN (bing) new to the search game is bad at determining if a link is natural or artificial in nature due to sucking at link analysis they place too much weight on the page content their poor relevancy algorithms cause a heavy bias toward commercial results likes bursty recent links new sites that are generally untrusted in other systems can rank quickly in MSN Search things like cheesy off topic reciprocal links still work great in MSN Search

Google has been in the search game a long time, and saw the web graph when it is much cleaner than the current web graph is much better than the other engines at determining if a link is a true editorial citation or an artificial link looks for natural link growth over time heavily biases search results toward informational resources trusts old sites way too much a page on a site or subdomain of a site with significant age or link related trust can rank much better than it should, even with no external citations they have aggressive duplicate content filters that filter out many pages with similar content if a page is obviously focused on a term they may filter the document out for that term. on page variation and link anchor text variation are important. a page with a single reference or a few references of a modifier will frequently outrank pages that are heavily focused on a search phrase containing that modifier crawl depth determined not only by link quantity, but also link quality. Excessive low quality links may make your site less likely to be crawled deep or even included in the index. things like cheesy off topic reciprocal links are generally ineffective in Google when you consider the associated opportunity cost

Ask looks at topical communities due to their heavy emphasis on topical communities they are slow to rank sites until they are heavily cited from within their topical community due to their limited market share they probably are not worth paying much attention to unless you are in a vertical where they have a strong brand that drives significant search traffic

History SMART – Salton’s Magic Information Retrieval of Text – Vector Space Model – Relevance feedback algorithm (customization) – Latent Semantic Indexing (LSI)

Basic Vector Space Algo Vanilla Search Algo Key word search (ignore search modifiers e.g. not, and, this, their, is, or, of, and stop words Remove punctuation marks Reduce words to their root form (stemming) – Combination of suffix and prefix – Eg: students  student swam  swim lemmatization  stochastic algorithm science, scientist??

Documents to be indexed Document 1 – Search technologies have been around for over forty years. Over this time, their user base expanded first from scientists and technologists to information professionals, and finally from information professionals to pretty much everyone.

Document 2 – Math and Physics students are familiar with the challenge of finding the unambiguous “right answer”. The same is not true for information retrieval. Finding the “right document” may be as much art as science.

Document 3 – Many serial killers do not suffer from psychosis and appear to be quite normal. Search for such killers can take years, even with the latest police technologies, and the results are often shocking.

Stop words for removal Search technologies have been around for over forty years. Over this time, their user base expanded first from scientists and technologists to information professionals, and finally from information professionals to pretty much everyone. Math and Physics students are familiar with the challenge of finding the unambiguous “right answer”. The same is not true for information retrieval. Finding the “right document” may be as much art as science. Many serial killers do not suffer from psychosis and appear to be quite normal. Search for such killers can take years, even with the latest police technologies, and the results are often shocking.

Stemming Changes Identified search technology around forty years time user base expanded first science technology information professionals finally information professionals pretty much everyone math physics students familiar challenge finding unambiguous right answer information retrieval finding right document much art science many serial killers suffer psychosis appear normal search killers take years latest police technology results shocking

Unique words identified Search[1] technology[2] around[3] forty[4] year[5] time[6] user[7] base[8] expand[9] first[10] science[11] technology[2] information[12] professional[13] final[14] information[12] professional[13] pretty[15] much[16] everyone[17] math[18] physics[19] student[20] familiar[21] challenge[22] find[23] unambiguous[24] right[25] answer[26] information[12] retrieval[27] find[23] right[25] document[28] much[16] art[29] science[11] many[30] serial[31] killer[32] psychosis[33] appear[34] normal[35] search[1] killer[32] take[36] year[5] latest[37] police[38] technology[2] result[39] shock[40]

Search Ditionary [1] search [2] technology [3] around [4] forty [5] year [6] time………[40] shock

Representing documents as 40-dimensional vectors Values are in form of : Doc1(1:1, 2:2, 3:1, 4:1, 5:1, 6:1, 7:1,….,13:2,14:1, 15:1,…, 17:1, 18:0, 19:0,…,40:0) Doc2(1:0, 2:0, 3:0,…,11:1,12:1,…,16:1,17:0,18:1, 19:1, 20:1,..,29:1,30:0,31:0,….,40:0) Doc3(1:1,2:1,3:0,4:0,5:1,6:0,7:0,8:0,…,29:0, 30:1,31:2,32:2,33:1…,40:1)

Handling the Query “the promise of search technologies” the promise of search technology search and technology are present in dictionary, but “promise” is not so it will be avoided Hence the search becomes search technology, which is equivalent to (1:1, 2:1)....creating a new vector Converting it to 40 dimensional array (1:1, 2:1, 3:0, 4:0,….,40:0) Finally find the shortest distance (best match) between previously stored vectors.

Enhancements Weighting multiple occurrences – (1:1000, 2:1000) Weighting for phrases – Search technology – Police technology – Information professional – Information retrieval Word clustering – Search/retrieval/find – Technology/science/math/physics – First/final/latest Custom biases

Google Page ranking PR(A) = (1-d)+d (PR(T 1 )/C(T 1 ) + ….. + PR(T n )/C(T n )) A  page in question T 1 …T n  documents that reference PR  page rank C(T i )  total number of links to outside resources on page Ti D  heuristic damping factor usually set to 0.85

Web Spiders Selection policy Re-visit policy