Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky (http://www.majectic12.co.uk) at Birmingham Perl Mongers.

Slides:



Advertisements
Similar presentations
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advertisements

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Inverted Indexing for Text Retrieval Chapter 4 Lin and Dyer.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Databases & Data Warehouses Chapter 3 Database Processing.
Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Automated Relevancy Feedback Modification of Mozilla source code to add in event tracking and modify content of incoming pages & outgoing requests (HTTP.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Search Xin Liu.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
SEARCH ENGINES The World Wide Web contains a wealth of information, so much so that without search facilities it could be impossible to find what you were.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Information Retrieval in Practice
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Search Engine Architecture
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
Prepared by Rao Umar Anwar For Detail information Visit my blog:
The Anatomy Of A Large Scale Search Engine
Google and Scalable Query Services
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Anatomy of a search engine
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Search Engine Architecture
Inverted Indexing for Text Retrieval
Web Search Engines.
The Search Engine Architecture
Information Retrieval and Web Design
Presentation transcript:

Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky ( at Birmingham Perl Mongers User Group ( V1.0 27/07/05

Contents 1.History 2.Goals 3.Architecture 4.Implementation 5.Why not Perl? 6.Conclusions 7.Credits 8.Recommended reading

History (of my work in area of information retrieval) 1.First primitive pathetic stone-age search engine: 1000 documents in the “index” (1997, Perl) 2.Second engine using proper inverted indexing for Jungle.com: 500,000 products indexed (Perl + Java, 2002) 3.Current: 50,000,000 pages indexed with a lot more to go (to be revealed, 2005)

Goals 1.Build a distributed WWW search engine capable of dealing with at least 1 bln web pages based on principles of and 2.See to it that the chosen language for implementation (more on this later) fits purpose or more likely learn how to make it work 3.Eventually make some money out of it

Architecture 1.Data collection (crawling) 2.Indexing: turning text into numbers 3.Merging: turning indexed barrels into single searchable index 4.Searching: locating documents for given keywords

Data collection (crawling) Base Issues URLs to crawl and receives compressed pages Distributed crawlers – receive lists of URLs to crawl, crawl them and send back compressed data. In the future will do distributed indexing Note: this stage is optional if you already have data to index, ie list of products with their descriptions

Crawler screenshot 1

Crawler screenshot 2

Crawler screenshot 3

Crawler screenshot 4

Crawler screenshot 5

Current Stats Source: as of 27/07/05http://

Indexing Indexing is a process of turning words into numbers and creating inverted index. Data barrel Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl – 1 Mongers – 2 City – 3 Inverted Index (Each of the WordID has list of (ideally sorted) DocIDs) 0 -> 0, 1 1 -> 0, 2 2 -> 0, 3 -> 1, 2 Note: if you use database then it make sense to have clustered index on WordID

Merging Individual indexed barrels Single searchable index Note: this stage is not necessary if just one barrel is used as there will be no need to remap all Ids from local to their global equivalents.

Searching Searching is a process of finding documents that contain words from search query Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl – 1 Mongers – 2 City – 3 Inverted Index (lists DocIDs for each of the WordID) 0 -> 0, 1 1 -> 0, 2 2 -> 0, 3 -> 1, 2 Note: if you use database then it make sense to cluster on WordID Search query: “Birmingham Perl” WordIDs: 0, 1 Intersection of DocIDs present in both lists (implementation of boolean AND logic): 0 (Brum)1 (Perl)Result 00Matched! 1n/aNot matched! n/a 2Not matched!

Search engine screenshot 1

Search engine screenshot 2

Implementation 1.Microsoft.NET C# ported to Linux using Mono ( project.com) project.com 2.~90k lines of code (minimal copy/paste) written from scratch 3.Low level of dependencies (SharpZipLib/SQLite/NPlot)

Why not Perl? (using C# instead) 1.Not strong in GUI department 2.Hard to deal with Multi-Threading and Asyncronous sockets 3.OOP is more of a hack 4.Lax compile-time checks due to not being strictly typed 5.Fear of performance bottlenecks forcing to use C++ 6.Hard to profile for performance analysis 7.Managed memory lacks support for pointers (?) 8.Poor exceptions handling 9.I wanted something new :)

Conclusions Still work in progress, but some conclusions can be made already: 1.Inverted indexing approach helps to achieve fast searches 2.Its tough to build one – don’t try if you ain’t going to see it through! 3.Crawler is one tough piece of code – 6 months vs 2 months on searching 4..NET C# is a decent language suitable for heavy duty tasks like this

Credits 1.R&D: Alex Chudnovsky 2.Pioneers*: FiddleAbout, dazza12, lazytom, Mordac, linuxbren, Cyber911, Vari, ASB, SEOBy.org, arni, japonicus, webstek.info | Pimpel, DimPrawn, Zyron, partys-bei-uns.de, jake, bull at webmasterworld, nada, dodgy4, sri-heinzwww.vanginkel.info * Volunteers running crawler and who crawled at least 1 mln URLs as of 27/07/05

Recommended reading 1.“The Anatomy of a Large-Scale Hypertextual Web Search Engine” Sergey Brin and Lawrence Page of Google ( db.stanford.edu/~backrub/google.html) db.stanford.edu/~backrub/google.html 2.“Managing Gigabytes” Ian h. Witten et al ISBN

Join! Join the project (unmetered broadband required!): majestic12.co.uk Your name could be here!