Introduction to YouSeer Partha Mukherjee pom5109@ist.psu.edu
Outline Overview YouSeer components Heritrix Solr Demo
Overview Requirements YouSeer: is a complete and powerful open source search engine available on SourceForge that integrates the open source crawler Heritrix with the open source indexer Solr/Lucene. Java-based, and run successfully on Windows Requirements 512 MB RAM, 6.5 GB on Hard Disk Java 1.6 ( Java 1.5 also works)
Search Engine: Basic Workflow Courtesy of Saurabh Kataria
Advantages of YouSeer Built on top of scalable components Tested on 23M documents, while Solr and Heritrix can scale to billions Very flexible, and easy to extend Modifying the index and the ingestion module is easy The crawler supports complicated crawling policies
YouSeer Components Heritrix: Apache Solr: The Internet Archive’s crawler Reported to scale up to 1B documents Written in Java, and has a web interface Apache Solr: open source enterprise search server based on the Lucene Has REST-like API Supports caching, distributed search, and index replication
YouSeer Architecture WWW Storage Apache Tomcat DB Cache Request heritrix File System Middleware Apache Solr
Heritrix Workflow 1) Choose a URI from all among the scheduled 2) Fetch that URI 3)Analyze or archive the results 4) select discovered URIs of interest, and add to those scheduled 5) Note that the URI is done and repeat “An Introduction to Heritrix. An open source archival quality web crawler”. Gordon Mohr et al
Heritrix Crawl Result By default, heritrix writes all its crawled to disk as Internet Archive ARC files By default, Heritrix writes compressed version of ARC files The compression is done with gzip Each record (which contain a document) is gzipped All gzipped records are concatenated together to make up a file of multiple gzipped members
Apache Solr Very popular distribution of Lucene Easy to configure and optimize All modifications are in the XML files No need to touch the code The index has a schema, similar to database schema Think of the index as a table in the database, and you have to define the columns
Solr Schema Example <field name="url" type="string" indexed="true" stored="true"/> <field name="title" type="text" indexed="true" stored="true"/> <field name="keywords" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true"/> <field name="creationDate" type="date" indexed="true" stored="true"/> <field name="rating" type="sint" indexed="true" stored="true"/> <field name="published" type="boolean" indexed="true" stored="true"/> <field name="content" type="text" indexed="true" stored="true" /> <field name="all" type="text" indexed="true" stored="true" multiValued="true"/>
Solr Documents Solr accepts well formatted XML documents <add> <doc> <field name=“URL">www.cnn.com</field> <field name=“title">CNN Breaking News – Obama wins</field> <field name=“content">Barack Obama is the 44th president of the USA</field> <field name=“pubDate">2008-11-06T23:59:59.999Z</field> </doc> </add>
YouSeer workflow Waits for the crawled documents to be written Iterates on the compressed files, and process the documents Extract the textual content of the document, and parse metadata Generate an XML file as output Each custom extractor appends its result to this file This XML file is submitted to the index
Demo: Configurtion The schema of Solr is already configured in your installation Solr is installed on tomcat Heritrix web interface is listening on the port 8080 by default same as Apache TomCat server. So change it to some other port number i.e. ./hertitrix –p 9000
Demo Download Virtual Machine image from http://sourceforge.net/projects/youseer/files/VM/youseer.0.1/fedora-11-i386.zip/download Unzip fedora-11-i386.zip The virtual image is a linux VMware image To run the VM, you need to download and install VMware player from: http://www.vmware.com/products/player/ Double click on Vmware virtual machine configuration icon
Demo
Demo Get into YouSeer with password “heritrixsolr”. You are in a virtual Linux environment sitting in Windows. While leaving the VM environment Log out from youseer (“youseer -> quit” ) Shutdown the VM (“ shutdown”) Press Ctrl + Alt to work in your local machine.
Demo
Demo About to start Heritrix (crawler) !!! In VM open a terminal Go to apps directory (cd apps) You find solr, tomcate, heritrix-1.14.3 etc applications Don’t forget to start up solr server before running heritrix Go to apache-solr…/example/ Locate the jar file “start.jar” and run it. Solr should run all the time.
Demo
Demo
Demo Now open another terminal or another tab from the same terminal Go to heritrix-1.14.3 under /home/apps. Run heritrix application with the following command line arguments ./heritrix –p XXXX - -admin=nameX:passwordX Now open the browser in VM and type the URL http://localhost:XXXX Get heritrix UI (Username= nameX and password = passwordX)
Demo: Heritrix Heritrix log in screen
Demo: Heritrix
Demo: Heritrix Enter the Seed URLs
Demo: Heritrix Configure first job Enter a valid URL and email address Most important parameter is user agent under configurations Enter a valid URL and email address Enter http://www.psu.edu And your OWN email address Do not run more than 5 threads Avoid machine “tireness” and system crash.
Demo: Heritrix Change the Agent URL
Demo: Features of Heritrix
Demo: More features
Demo : Heritrix
Demo ARC files are written to: To start tomcat, enter start-tomcat ~/crawler/heritrix-1.14.3/jobs/JOB-NAME/arcs To start tomcat, enter start-tomcat Solr will start automatically YouSeer ingestion module (middleware) is located under: ~/youseer/release Add folder entry to Apache web server configuration file Retrieve cached copies of documents from ARC files Use URL of the solr to post the document Specify number of working threads to process the documents Java –jar YouSeer.jar [IndexURL] [Path_ARCfiles] [Cached_virtual_Folder][Number_of_Threads][wait_Time]
Demo To index documents crawled by heritrix: Navigate to ~/youseer/release Run: java –jar YouSeer.jar http://localhost:8983/solr/update /absolute/path/to/arc/files /cachingDirectory 1 0 Solr URL The full path to the ARC files The virtual directory which maps to the cached files Number of threads, please keep it <5 Waiting Time between retries
Demo
Comments YouSeer tracks which arc files has been processed into the database, default name is submitted.db If you want to re-ingest the documents, Map virtual directory within TomCat directory Update the submitted.db file Execute $ path= /cached docBase=“/heritrix-1.14.3/jobs/JOB_NAME/arcs” crossContext=“false” debug=“0” reloadable=“true”/ The search interface: http://localhost:8080/youseerui
Shots
Test case (http://pike.psu.edu)
Test Case(:pike)
References Want to Download separately?? http://youseer.sourceforge.net/doc/Tutorial.pdf http://crawler.archive.org/articles/user_manual/ http://lucene.apache.org/solr/tutorial.html Want to Download separately?? https://sourceforge.net/projects/youseer/ https://sourceforge.net/projects/archive –crawler/files/archive-crawler%20(heritrix%201.x)/ http://www.apache.org/dyn/closer.cgi/lucene/solr
THANK YOU