Introduction to YouSeer

Slides:



Advertisements
Similar presentations
An Introduction To Heritrix
Advertisements

A Toolbox for Blackboard Tim Roberts
Hyrax Installation and Customization ESIP ‘08 Summer Meeting Best Practices in Services and Data Interoperability Dan Holloway James Gallagher.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
How to Use LucidWorks Search
DNR-322L & DNR-326.
Object-Oriented Enterprise Application Development Tomcat 3.2 Configuration Last Updated: 03/30/2001.
XMAS installation instructions Windows Version: 1.0 4/22/2008.
(NHA) The Laboratory of Computer Communication and Networking Network Host Analyzer.
Thank you SPSKC15 sponsors!. SharePoint 2013 Search Service Application (SSA) Ambar Nirgudkar Software Engineer
Bonrix SMPP Gateway Index Introduction Architecture diagram Set up diagram System & Software Requirements Installation Deployment Operations HTTP.
TUTORIAL (1) Software installation Written by: Eng. Ahmed Mohamed Abdel Ghafar, Edited by Eng. Muhammed Hammad, Eng. Hamdy Soltan & Eng. Osama Talaat.
Sharepoint Portal Server Basics. Introduction Sharepoint server belongs to Microsoft family of servers Integrated suite of server capabilities Hosted.
Implementing search with free software An introduction to Solr By Mick England.
Talend 5.4 Architecture Adam Pemble Talend Professional Services.
VMWare Workstation Installation. Starting Vmware Workstation Go to the start menu and start the VMware Workstation program. *Note: The following instructions.
Hyrax Installation and Customization Dan Holloway James Gallagher.
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
Linux Operations and Administration
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
Customized cloud platform for computing on your terms !
1 Web Server Concepts Dr. Awad Khalil Computer Science Department AUC.
Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.
SchwartzGBIF Nodes III29 April 2003 DiGIR Portal Installation And Configuration.
Integrating with UCSF’s Shibboleth system
 2001 Prentice Hall, Inc. All rights reserved. 1 Chapter 21 - Web Servers (IIS, PWS and Apache) Outline 21.1 Introduction 21.2 HTTP Request Types 21.3.
Bonrix SMPP Client. Index Introduction Software and Hardware Requirements Architecture Set Up Installation HTTP API Features Screen-shots.
Revolutionizing enterprise web development Searching with Solr.
Running Kuali: A Technical Perspective Ailish Byrne - Indiana University Jay Sissom - Indiana University Foundation.
Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.
6 th Annual Focus Users’ Conference Manage Integrations Presented by: Mike Morris.
Oracle 10g Database Administrator: Implementation and Administration Chapter 2 Tools and Architecture.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
0 SharePoint Search 2013 Rafael de la Cruz SharePoint Developer Seneca Resources twitter.com/delacruz_rafael
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Running Kuali: A Technical Perspective Ailish Byrne (Indiana University) Jonathan Keller (University of California, Davis)
VApp Product Support Engineering Rev E VMware Confidential.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
9 Copyright © 2004, Oracle. All rights reserved. Getting Started with Oracle Migration Workbench.
CACI Proprietary Information | Date 1 Upgrading to webMethods Product Suite Name: Semarria Rosemond Title: Systems Analyst, Lead Date: December 8,
SAP Business One 9.0 integration for SAP NetWeaver Installation and Technical Configuration 2013 March.
BY: SALMAN 1.
The Holmes Platform and Applications
Presented by [Harshit Agrawal] 04/03/2017
Bonrix SMPP Gateway
Progress Apama Fundamentals
Integrating ArcSight with Enterprise Ticketing Systems
Integrating ArcSight with Enterprise Ticketing Systems
Hyrax Configuration.
BY: SALMAN.
IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Getting Started with Alfresco Development
Working with Feature Layers
Dspace Statistics: Google Analytics, Solr
Building Search Systems for Digital Library Collections
Testing REST IPA using POSTMAN
What Is Sharepoint? Mohsen Ashkboos
Crawling with Heritrix
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to Apache
Introduction to JBoss application server
Getting Started With Solr
bitcurator-access-webtools Quick Start Guide
Tutorial 7 – Integrating Access With the Web and With Other Programs
Indexing with ElasticSearch
Presentation transcript:

Introduction to YouSeer Partha Mukherjee pom5109@ist.psu.edu

Outline Overview YouSeer components Heritrix Solr Demo

Overview Requirements YouSeer: is a complete and powerful open source search engine available on SourceForge that integrates the open source crawler Heritrix with the open source indexer Solr/Lucene. Java-based, and run successfully on Windows Requirements 512 MB RAM, 6.5 GB on Hard Disk Java 1.6 ( Java 1.5 also works)

Search Engine: Basic Workflow Courtesy of Saurabh Kataria

Advantages of YouSeer Built on top of scalable components Tested on 23M documents, while Solr and Heritrix can scale to billions Very flexible, and easy to extend Modifying the index and the ingestion module is easy The crawler supports complicated crawling policies

YouSeer Components Heritrix: Apache Solr: The Internet Archive’s crawler Reported to scale up to 1B documents Written in Java, and has a web interface Apache Solr: open source enterprise search server based on the Lucene Has REST-like API Supports caching, distributed search, and index replication

YouSeer Architecture WWW Storage Apache Tomcat DB Cache Request heritrix File System Middleware Apache Solr

Heritrix Workflow 1) Choose a URI from all among the scheduled 2) Fetch that URI 3)Analyze or archive the results 4) select discovered URIs of interest, and add to those scheduled 5) Note that the URI is done and repeat “An Introduction to Heritrix. An open source archival quality web crawler”. Gordon Mohr et al

Heritrix Crawl Result By default, heritrix writes all its crawled to disk as Internet Archive ARC files By default, Heritrix writes compressed version of ARC files The compression is done with gzip Each record (which contain a document) is gzipped All gzipped records are concatenated together to make up a file of multiple gzipped members

Apache Solr Very popular distribution of Lucene Easy to configure and optimize All modifications are in the XML files No need to touch the code The index has a schema, similar to database schema Think of the index as a table in the database, and you have to define the columns

Solr Schema Example <field name="url" type="string" indexed="true" stored="true"/> <field name="title" type="text" indexed="true" stored="true"/> <field name="keywords" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true"/> <field name="creationDate" type="date" indexed="true" stored="true"/> <field name="rating" type="sint" indexed="true" stored="true"/> <field name="published" type="boolean" indexed="true" stored="true"/> <field name="content" type="text" indexed="true" stored="true" /> <field name="all" type="text" indexed="true" stored="true" multiValued="true"/>

Solr Documents Solr accepts well formatted XML documents <add> <doc> <field name=“URL">www.cnn.com</field> <field name=“title">CNN Breaking News – Obama wins</field> <field name=“content">Barack Obama is the 44th president of the USA</field> <field name=“pubDate">2008-11-06T23:59:59.999Z</field> </doc> </add>

YouSeer workflow Waits for the crawled documents to be written Iterates on the compressed files, and process the documents Extract the textual content of the document, and parse metadata Generate an XML file as output Each custom extractor appends its result to this file This XML file is submitted to the index

Demo: Configurtion The schema of Solr is already configured in your installation Solr is installed on tomcat Heritrix web interface is listening on the port 8080 by default same as Apache TomCat server. So change it to some other port number i.e. ./hertitrix –p 9000

Demo Download Virtual Machine image from http://sourceforge.net/projects/youseer/files/VM/youseer.0.1/fedora-11-i386.zip/download Unzip fedora-11-i386.zip The virtual image is a linux VMware image To run the VM, you need to download and install VMware player from: http://www.vmware.com/products/player/ Double click on Vmware virtual machine configuration icon

Demo

Demo Get into YouSeer with password “heritrixsolr”. You are in a virtual Linux environment sitting in Windows. While leaving the VM environment Log out from youseer (“youseer -> quit” ) Shutdown the VM (“ shutdown”) Press Ctrl + Alt to work in your local machine.

Demo

Demo About to start Heritrix (crawler) !!! In VM open a terminal Go to apps directory (cd apps) You find solr, tomcate, heritrix-1.14.3 etc applications Don’t forget to start up solr server before running heritrix Go to apache-solr…/example/ Locate the jar file “start.jar” and run it. Solr should run all the time.

Demo

Demo

Demo Now open another terminal or another tab from the same terminal Go to heritrix-1.14.3 under /home/apps. Run heritrix application with the following command line arguments ./heritrix –p XXXX - -admin=nameX:passwordX Now open the browser in VM and type the URL http://localhost:XXXX Get heritrix UI (Username= nameX and password = passwordX)

Demo: Heritrix Heritrix log in screen

Demo: Heritrix

Demo: Heritrix Enter the Seed URLs

Demo: Heritrix Configure first job Enter a valid URL and email address Most important parameter is user agent under configurations Enter a valid URL and email address Enter http://www.psu.edu And your OWN email address Do not run more than 5 threads Avoid machine “tireness” and system crash.

Demo: Heritrix Change the Agent URL

Demo: Features of Heritrix

Demo: More features

Demo : Heritrix

Demo ARC files are written to: To start tomcat, enter start-tomcat ~/crawler/heritrix-1.14.3/jobs/JOB-NAME/arcs To start tomcat, enter start-tomcat Solr will start automatically YouSeer ingestion module (middleware) is located under: ~/youseer/release Add folder entry to Apache web server configuration file Retrieve cached copies of documents from ARC files Use URL of the solr to post the document Specify number of working threads to process the documents Java –jar YouSeer.jar [IndexURL] [Path_ARCfiles] [Cached_virtual_Folder][Number_of_Threads][wait_Time]

Demo To index documents crawled by heritrix: Navigate to ~/youseer/release Run: java –jar YouSeer.jar http://localhost:8983/solr/update /absolute/path/to/arc/files /cachingDirectory 1 0 Solr URL The full path to the ARC files The virtual directory which maps to the cached files Number of threads, please keep it <5 Waiting Time between retries

Demo

Comments YouSeer tracks which arc files has been processed into the database, default name is submitted.db If you want to re-ingest the documents, Map virtual directory within TomCat directory Update the submitted.db file Execute $ path= /cached docBase=“/heritrix-1.14.3/jobs/JOB_NAME/arcs” crossContext=“false” debug=“0” reloadable=“true”/ The search interface: http://localhost:8080/youseerui

Shots

Test case (http://pike.psu.edu)

Test Case(:pike)

References Want to Download separately?? http://youseer.sourceforge.net/doc/Tutorial.pdf http://crawler.archive.org/articles/user_manual/ http://lucene.apache.org/solr/tutorial.html Want to Download separately?? https://sourceforge.net/projects/youseer/ https://sourceforge.net/projects/archive –crawler/files/archive-crawler%20(heritrix%201.x)/ http://www.apache.org/dyn/closer.cgi/lucene/solr

THANK YOU