Crawling with Heritrix

Crawling with Heritrix
Agnese Chiatti and Lee Giles Thanks to Sagnik Ray Choudhury

Accessing the IST441 server
About PUTTY A open source terminal emulator, that, among other protocols, supports SSH (Secure Shell). Normally used for Microsoft Windows Ensuring secure login to remote server Each team is assigned a dedicated folder, with writing/reading and executing permissions only under that folder

Instructions to access
Access to VLABS at from browser Download PUTTY from Double click on the downloaded file and ignore the warning to proceed

Instructions to access (2)
Something like the window shown on the side should open Host name is ist441giles.ist.psu.edu [Check that default port shown is 22 and SSH connection type is selected] Once a terminal window pops up, login using your PSU credentials

2. Survival command line Navigate to your team’s folder
cd /data/ist441/team<team-number> e.g. cd /data/ist441/team1 Example: creating and removing a file just created touch test.txt rm test.txt CAUTION with the rm command → [ Optional ] Survival Linux Commands (+ warnings about using the rm command :) ) Resources for vim [Editing files directly in the team folder]

3. How does a crawler work? A web crawler follows a simple algorithm:
Start with a queue containing a set of web URIs. If the queue is empty, exit, else, get a URI from the queue. Go to the URI. With a caveat: we might not want to crawl certain contents Fetch contents from the URI currently picked from the queue Store contents on disk Contents might be crawled but not stored. Why? Extract links (i.e., new URI) from the extracted content Add extracted links to the queue Go to Step 2

4. Heritrix Open source crawler developed by the Internet Archive PROS
Highly scalable Easily configurable CONS Difficult to crawl dynamic pages Configuration files might not be trivial to read

5. Configuring your first crawling job
Open browser Heritrix GUI is available e.g. for team 1:

Click on Advanced to proceed

Authentication You will need to login with the username and password provided in class

Which configurations do you need to change?
Contact info (line 40) Seed URI, e.g., Dr. Giles’ homepage (line 57) All the seed URI you selected for your own project will have to be added in that sections

Making use of filters: storing only HTMLs
In this example: storing only HTML files (lines ) Syntax is based on regular expressions (regex) Full documentation (for other file types):

Launch and Run a Heritrix job
After saving changes to the configuration file Click on Build > Then Launch Refresh page Unpause Verify that job is running

Crawling with Heritrix

Similar presentations

Presentation on theme: "Crawling with Heritrix"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Crawling with Heritrix

Similar presentations

Presentation on theme: "Crawling with Heritrix"— Presentation transcript:

Similar presentations

About project

Feedback