Download presentation
Presentation is loading. Please wait.
1
INF 141: Information Retrieval
Discussion Session Week 3 – Winter 2010 TA: Sara Javanmardi
2
How to submit Answers For Assignment3
Create a PDF le containing your answer to general and extra credit questions. For the programming question, create a txt le containing the answers and a jar le containing the code. Put all les in a folder. Make the folder name < StudentID >{< StudentID >{< StudentID >{Assignment03}, zip it and submit it to EEE (Only one of the team members needs to submit the .zip file).
3
Grading Assignment 1 & 2 Come to my office at ICS1:408E (all team members) at one of the following time slots Wednesday, Jan 19, 3:30 pm to 6 pm Monday, Jan 24, 9 am to 12 pm I will ask you to explain your algorithm, to run your code on a test input file. I might ask some other general questions related to these two assignments.
4
Quiz 1 Next week Jan 26, in the discussion class Closed Book
All material covered in weeks 1, 2, 3.
5
Assignment 3 You can get it from Deadline Jan 30
Deadline Jan 30
6
Crawler4j http://code.google.com/p/crawler4j/ Read the sample usage
Download crawler4j-2.2.zip Unzip it and put the ‘lib’ folder in your java project. crawler4j-dependencies-lib.zip *Unzip it add the .jar file to the ‘lib’ folder *Create a folder called ‘resources’ and put the .properties file in it
7
The ‘lib’ & ‘resources’ folder
8
Main Classes Create two classes Controller MyCrawler
Controller MyCrawler
9
Controller Class At this time you should see some errors
Oops forgot to import the .jar files!
10
Add External Jars Select All and press open and then OK
11
Add Sources To The Classpath
12
Controller:Setting The Parameters
13
MyCralwer: Main Methods
shouldVisit(WebURL url) Should I put this URL in frontier or not? visit(Page page) How should I process this page coming from the head of the frontier? page.getWebURL().getURL(); page.getHTML(); page.getText(); page.getURLs() Example
14
An Example
15
Content Articles http://en.wikipedia.org/wiki/Bing_search_engine
public static boolean isArticle(String titlePartOfUrl) { if (titlePartOfUrl.startsWith("Image:") || titlePartOfUrl.startsWith("Wikipedia:") || titlePartOfUrl.startsWith("Category:")|| titlePartOfUrl.startsWith("Special:") || titlePartOfUrl.startsWith("Image_talk:")|| titlePartOfUrl.startsWith("Portal:")|| titlePartOfUrl.startsWith("Wikipedia_talk:") || titlePartOfUrl.startsWith("User:")|| titlePartOfUrl.startsWith("Template:")|| titlePartOfUrl.startsWith("Template_talk:") || titlePartOfUrl.startsWith("Help:")|| titlePartOfUrl.startsWith("Talk:")|| titlePartOfUrl.startsWith("User_talk:") || titlePartOfUrl.startsWith("Category_talk:") || titlePartOfUrl.startsWith("Media:")|| titlePartOfUrl.startsWith("MediaWiki:") || titlePartOfUrl.startsWith("File:") || titlePartOfUrl.startsWith("MediaWiki_Talk:")) {return false;} return true;}
16
Main Questions To Answer
How to count unique terms, what data structure? How to write the result in file(s)? How to solve concurrency problems that might happen? Static Synchronized Atomic Integer
17
Example: IO I have only one file and all threads(crawlers) write in it
Each thread has its own file and I merge all files when the threads threads(crawlers) are done.
18
Sample Code Snippet private static PrintStream out; static { try {
out = new PrintStream("/home/sara/Wikipedia2/Train-Test-Features/testUserPageStatus.txt"); } catch (FileNotFoundException e) { e.printStackTrace(); }} public MyCrawler() { }
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.