Presentation is loading. Please wait.

Presentation is loading. Please wait.

Crawling with Heritrix

Similar presentations


Presentation on theme: "Crawling with Heritrix"— Presentation transcript:

1 Crawling with Heritrix
Agnese Chiatti and Lee Giles Thanks to Sagnik Ray Choudhury

2 Accessing the IST441 server
About PUTTY A open source terminal emulator, that, among other protocols, supports SSH (Secure Shell). Normally used for Microsoft Windows Ensuring secure login to remote server Each team is assigned a dedicated folder, with writing/reading and executing permissions only under that folder

3 Instructions to access
Access to VLABS at from browser Download PUTTY from Double click on the downloaded file and ignore the warning to proceed

4 Instructions to access (2)
Something like the window shown on the side should open Host name is ist441giles.ist.psu.edu [Check that default port shown is 22 and SSH connection type is selected] Once a terminal window pops up, login using your PSU credentials

5 2. Survival command line Navigate to your team’s folder
cd /data/ist441/team<team-number> e.g. cd /data/ist441/team1 Example: creating and removing a file just created touch test.txt rm test.txt CAUTION with the rm command → [ Optional ] Survival Linux Commands (+ warnings about using the rm command :) ) Resources for vim [Editing files directly in the team folder]

6 3. How does a crawler work? A web crawler follows a simple algorithm:
Start with a queue containing a set of web URIs. If the queue is empty, exit, else, get a URI from the queue. Go to the URI. With a caveat: we might not want to crawl certain contents Fetch contents from the URI currently picked from the queue Store contents on disk Contents might be crawled but not stored. Why? Extract links (i.e., new URI) from the extracted content Add extracted links to the queue Go to Step 2

7 4. Heritrix Open source crawler developed by the Internet Archive PROS
Highly scalable Easily configurable CONS Difficult to crawl dynamic pages Configuration files might not be trivial to read

8 5. Configuring your first crawling job
Open browser Heritrix GUI is available e.g. for team 1:

9 Click on Advanced to proceed

10 Authentication You will need to login with the username and password provided in class

11

12 Which configurations do you need to change?
Contact info (line 40) Seed URI, e.g., Dr. Giles’ homepage (line 57) All the seed URI you selected for your own project will have to be added in that sections

13 Making use of filters: storing only HTMLs
In this example: storing only HTML files (lines ) Syntax is based on regular expressions (regex) Full documentation (for other file types):

14 Launch and Run a Heritrix job
After saving changes to the configuration file Click on Build > Then Launch Refresh page Unpause Verify that job is running


Download ppt "Crawling with Heritrix"

Similar presentations


Ads by Google