Building a Vertical Search Site (using lots of Apache software, of course)

Slides:



Advertisements
Similar presentations
Repositories The Algoma University Experience By Robin Isard, Algoma University.
Advertisements

EasySearch Technical Overview. Ever seen a website without a full text search? BUT – Search is expensive Financially Computationally – Search is complicated.
1 Database Driven Web Application Clients Application Servers including web servers Database Server Traditional client-server (2-tier architecture): client:
How to Use LucidWorks Search
Acquia Cloud Drupal Platform-as-a-Service. Market Size [1,00,000+ sites] Innovation [10,000+ modules] Community [500,000+ members] “… is as much a Social.
Web Crawlers Nutch. Agenda What are web crawlers Main policies in crawling Nutch Nutch architecture.
2/11/2004 Internet Services Overview February 11, 2004.
Part 1: Overview of Web Systems Part 2: Peer-to-Peer Systems Internet Computing Workshop Tom Chothia.
Multiple Tiers in Action
CS CS 5150 Software Engineering Lecture 13 System Architecture and Design 1.
CS CS 5150 Software Engineering Lecture 13 System Architecture and Design 1.
Component Based Systems Analysis Introduction. Why Components? t Development alternatives: –In-house software –Standard packages –Components 60% of the.
Charlie Crocker Vice President Farallon Geographics, Inc. An Overview of Internet Mapping Technology.
E-Commerce The technical side. LAMP Linux Linux Apache Apache MySQL MySQL PHP PHP All Open Source and free packages. Can be installed and run on most.
GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)
Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.
Implementing search with free software An introduction to Solr By Mick England.
Web application architecture
Web server and web browser It’s a take and give policy in between client and server through HTTP(Hyper Text Transport Protocol) Server takes a request.
Search Search Drupal with Apache Solr with CERN Web Communications Group – Copyright 2013.
Cross Platform Mobile Backend with Mobile Services James
IBIS System: Requirements and Components Lois M. Haggard Office of Public Health Assessment.
©2012 Microsoft Corporation. All rights reserved. Content based on SharePoint 15 Technical Preview and published July 2012.
Cloud Computing. Cloud Computing Overview Course Content
Is Your Website Hackable? Check with Acunetix Web Vulnerability Scanner. Acunetix Web Vulnerability Scanner V9.
WIDAR Prototype Testing User Interface Software Kevin Ryan NRAO-DRAO Face-to-Face Meeting April 3, 2006.
VIVO Multi-site search Structure and function overview.
Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.
1 Apache. 2 Module - Apache ♦ Overview This module focuses on configuring and customizing Apache web server. Apache is a commonly used Hypertext Transfer.
Apache Web Server v. 2.2 Reference Manual Chapter 1 Compiling and Installing.
Revolutionizing enterprise web development Searching with Solr.
National Center for Supercomputing Applications NCSA OPIE Presentation November 2000.
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
Presentation Title Name, Organization Date.
AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, Harney 235.
Grid Security in a production environment: 4 years of running Andrew McNab University of Manchester.
Web Architecture Introduction
AFTERCOLLEGE SELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY Kai Hu Haiyan Wu March 17, Cowell 416 Midterm Presentation.
Windows Azure Conference 2014 LAMP on Windows Azure.
Date : 3/04/2010 Web Technology Solutions Class: PHP Web Application Frameworks.
Web Server Apache PHP HTTP Request User types URL into browser Address resolved if nec. We use directly Most browsers request.
UNDERSTANDING YOUR OPTIONS FOR CLIENT-SIDE DEVELOPMENT IN OFFICE 365 Mark Rackley
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
© 2012 IBM Corporation IBM Linear Tape File System (LTFS) Overview and Demo.
Alfresco Scalability Benchmarking Before telling how cool Alfresco is, you better prove it!
Crafter case: European Bank Piergiorgio Lucidi Open Source ECM Specialist Certified Alfresco Instructor and Engineer Alfresco Wiki Gardener and Forum Moderator.
Welcome to IBC233 Cindy Laurin And Russ Pangborn.
Rick Mason, MSU Advancement.  Find the file C:\ColdFusion9\Solr\Solr.lax  Up memory from 256 to 1024  Lax.nl.current.vm point to \bin\javaw.exe under.
VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624 Clients: Mohamed Magdy and Tarek Kanan Blacksburg, VA 5/6/2014.
Post-relational databases What's wrong with web development? Dobrica Pavlinušić FOI, Razmjena Vještina, Varaždin,
XNAT 1.7: Getting Started 6 June, Introduction In this presentation we’ll discuss:  Features and functions in XNAT 1.7  Requirements  Installing.
Language HIGH LEVEL Overview
Web Technology Solutions
Building Library Web Site Using Drupal
WebSphere Diego Leone.
CIIT-Human Computer Interaction-CSC456-Fall-2015-Mr
Cleveland SQL Saturday Catch-All or Sometimes Queries
IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.
Getting Started with Alfresco Development
HISTORY Of API.
Post-relational databases What's wrong with web development?
Developing Web-Based Applications
Multitier Architecture, MySQL & PHP
Web Browser server client 3-Tier Architecture Apache web server PHP
A Quick Overview of ASP.NET Core 1.0
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Introduction to Nutch Zhao Dongsheng
UFCEUS-20-2 Web Programming
Presentation transcript:

Building a Vertical Search Site (using lots of Apache software, of course)

Just the Facts, Ma’am Ken Krugler - CTO/co-founder of Krugle We use lots of Apache S/W at krugle.org –Httpd, Lucene, Nutch, Solr, Xerces, etc. I’ll describe our architecture And the sometimes painful lessons learned

Three Faces of Krugle Free public site - Partner sites – – – Enterprise appliance

Krugle.org free public site Search code, projects, & technical web pages 150,000 projects 2.5billion lines of code 40million web pages

Krugle.org Architecture (web) Web tier runs Apache Also mod_perl –“glue” for Javascript to backend RESTful API –Partner APIs “Dirty” side of system

Krugle.org Architecture (API) API server uses Resin Webapps provide RESTful API services Filer is big disk array –LightTPD, NFS Searchers run Hadoop, Lucene

Krugle.org Architecture (CPI) Page crawl uses Nutch Code crawl uses bits of Nutch, custom stuff Fuzzy parsers created using ANTLR Project data in MySQL, pushed to Solr Code index is Lucene

Krugle partner sites IBM developerWorks Sourceforge.net Amazon Web Services Yahoo! Dev Network Collabnet

Krugle Architecture (partners) Higher level API Wraps RESTful API Handled in web tier Big chunks of Perl LightTPD cache

Krugle enterprise server Krugle inside firewall Talks to major SCMs SCM Comment search Includes public site info

Krugle Architecture (enterprise) Collapses web tier, API server, code searchers, filer, and DB server Separate admin system (DB, GUI, code crawler, configuration) as Jetty-hosted webapp

RESTful API HTTP requests, XML responses Works well with Perl middleware Some load/memory issues Solr integration challenges Integration test challenges

Key Lessons If it isn’t broke, don’t upgrade –There’s always a newer version –That includes the build system Be prepared to pay for free software –Motivating project contributors to do things Moderation in architectural abstraction –There’s always a higher and lower option