CrawlBuddy The web’s best friend.

Slides:



Advertisements
Similar presentations
MICHAEL MARINO CSC 101 Whats New in Office Office Live Workspace 3 new things about Office Live Workspace are: Anywhere Access Store Microsoft.
Advertisements

Distributed Processing, Client/Server and Clusters
Parasol Architecture A mild case of scary asynchronous system stuff.
Edoclite and Managing Client Engagements What is Edoclite? How is it used at IU? Development Process?
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
Task Scheduling and Distribution System Saeed Mahameed, Hani Ayoub Electrical Engineering Department, Technion – Israel Institute of Technology
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.
Computer Science Lecture 6, page 1 CS677: Distributed OS Processes and Threads Processes and their scheduling Multiprocessor scheduling Threads Distributed.
490dp Synchronous vs. Asynchronous Invocation Robert Grimm.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
Proxy Cache Leonid Romanovsky Olga Fomenko Winter 2003 Instructor: Konstantin Sinyuk.
Bar|Scan ® Asset Inventory System The leader in asset and inventory management.
CS364 CH08 Operating System Support TECH Computer Science Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management.
Event-Driven Architecture Team 4 – Idris Callins, Jestin Keaton, Bill Pegg, Steven Ng.
Christopher Jeffers August 2012
1 I-Logix Professional Services Specialist Rhapsody IDF (Interrupt Driven Framework) CPU External Code RTOS OXF Framework Rhapsody Generated.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
London April 2005 London April 2005 Creating Eyeblaster Ads The Rich Media Platform The Rich Media Platform Eyeblaster.
Murach’s ASP.NET 4.0/VB, C1© 2006, Mike Murach & Associates, Inc.Slide 1.
المحاضرة الاولى Operating Systems. The general objectives of this decision explain the concepts and the importance of operating systems and development.
Putting it all together Dynamic Data Base Access Norman White Stern School of Business.
Edukey Education Ltd
Upcoming Presentations ILM Professional Service – Proprietary and Confidential ( DateTimeTopicPresenter March PM Distributed.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Our MP3 Search Engine Crawler –Searching for Artist Name –Searching for Song Title Website Difficulties Looking Back.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
1 Why Threads are a Bad Idea (for most purposes) based on a presentation by John Ousterhout Sun Microsystems Laboratories Threads!
Navigation Framework using CF Architecture for a Client-Server Application using the open standards of the Web presented by Kedar Desai Differential Technologies,
Procedural programming Procedural programming is where you specify the steps required. You do this by making the program in steps. Procedural programming.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Jython Environment For Students (JES) Final Presentation Team 3 David Raines Claire Bailey Jason Ergle Josh Sklare July 16,
Multithreading vs. Event Driven in Code Development of High Performance Servers.
Pitfalls: Time Dependent Behaviors CS433 Spring 2001 Laxmikant Kale.
Introduction to Operating Systems Concepts
Test Lab Management and
Applied Operating System Concepts
Advanced Operating Systems CIS 720
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Process Management Process Concept Why only the global variables?
Processes and Threads Processes and their scheduling
Behavioral Design Patterns
MICROSOFT OUTLOOK and Outlook service Provider
CS399 New Beginnings Jonathan Walpole.
William Stallings Computer Organization and Architecture
Intro to Processes CSSE 332 Operating Systems
Lecture 7 Processes and Threads.
Threads and Cooperation
Chapter 1: Introduction
Introduction to Operating Systems
Concepts From Alice Switching to Java Copyright © Curt Hill.
Lecture 1: Multi-tier Architecture Overview
Jython Environment For Students (JES) Final Presentation
Half-Sync/Half-Async (HSHA) and Leader/Followers (LF) Patterns
Implementing an OpenFlow Switch on the NetFPGA platform
Background and Motivation
Multiprocessor and Real-Time Scheduling
Chapter 2: Operating-System Structures
Outline Chapter 2 (cont) OS Design OS structure
Chapter 10 Multiprocessor and Real-Time Scheduling
Why Threads Are A Bad Idea (for most purposes)
LO2 – Understand Computer Software
CS703 - Advanced Operating Systems
CS510 Operating System Foundations
System calls….. C-program->POSIX call
Why Threads Are A Bad Idea (for most purposes)
Chapter 2: Operating-System Structures
Why Threads Are A Bad Idea (for most purposes)
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

CrawlBuddy The web’s best friend

Design Decisions Two paradigms of design to choose from: multi-threaded or event-driven

Multi-Threaded Programming Advantages Programming is easier because threads are linear and we (usually) think linearly Threads can take advantage of multiprocessors easily Threads are synchronous i.e. it is okay for a thread to block because there are many of them running at once Debugging a threaded program is considerably easier than an event based program Disadvantages Threads are limited by the underlying operating system (operating systems can only efficiently handle so many threads)

Event-Driven Programming Advantages Handles well under heavy load, the queues act as a buffer to soften the load Simple to add new functionality and process in parallel Easy to split up and run on multiple machines Modular Disadvantages Not as intuitive as Thread programming Harder to debug system level errors (but easier to debug individual pieces)

What CrawlBuddy Does We took from the best of both worlds Event-driven multi-threaded design

Functional Units Are Our Friends Each Functional Unit has a … Queue – holds events to be processed Thread Pool – takes events off the queue and processes them Event Dispatcher – sends events to other Functional Units

Design of a Functional Unit Arrows represent flow of a task

CrawlBuddy Design Basically, events are passed between Functional Units The arrows (on the next slide) represent event flow

CrawlBuddy Design Flow

Wrapper Design Our wrapper crawler targets specific sites and uses site-specific format to find mp3s and record information about them (song name, artist name, etc) The wrapper Functional Units can be run in parallel and the each use the same database The Document Downloader passes each event to each of the wrappers. If the event does not apply to the wrapper (i.e. the document comes from a different site), the wrapper will simply drop the event

Wrapper Design Flow

Design Advantages Code re-use (Functional Units shared across CrawlBuddy and the wrapper) Expandable Checkpointing is simple (save the queues) Easy to run on multiple machines Queues buffer the load on threads Functional Units Replicable (see next slide)

Meta Queue How to replicate Functional Units

CrawlBuddy Features GUI

CrawlBuddy Features (cont) Checkpointing

CrawlBuddy Features (cont) Dynamic control of Functional Unit priority

CrawlBuddy Features (cont) Real-time stats Total downloads Total mp3 Downloads / sec Etc.

CrawlBuddy Features (cont) Thread status monitor

Mp3Monkey Search for all ‘e’ artists

Mp3Monkey Features Self-maintaining database – if a user attempts to download non-existent mp3, that url is marked for deletion Statistics are kept of how many searches and what has been downloaded