Presentation is loading. Please wait.

Presentation is loading. Please wait.

INF 141: Information Retrieval

Similar presentations


Presentation on theme: "INF 141: Information Retrieval"— Presentation transcript:

1 INF 141: Information Retrieval
Discussion Session Week 3 – Winter 2010 TA: Sara Javanmardi

2 How to submit Answers For Assignment3
Create a PDF le containing your answer to general and extra credit questions. For the programming question, create a txt le containing the answers and a jar le containing the code. Put all les in a folder. Make the folder name < StudentID >{< StudentID >{< StudentID >{Assignment03}, zip it and submit it to EEE (Only one of the team members needs to submit the .zip file).

3 Grading Assignment 1 & 2 Come to my office at ICS1:408E (all team members) at one of the following time slots Wednesday, Jan 19, 3:30 pm to 6 pm Monday, Jan 24, 9 am to 12 pm I will ask you to explain your algorithm, to run your code on a test input file. I might ask some other general questions related to these two assignments.

4 Quiz 1 Next week Jan 26, in the discussion class Closed Book
All material covered in weeks 1, 2, 3.

5 Assignment 3 You can get it from Deadline Jan 30
Deadline Jan 30

6 Crawler4j http://code.google.com/p/crawler4j/ Read the sample usage
Download crawler4j-2.2.zip Unzip it and put the ‘lib’ folder in your java project. crawler4j-dependencies-lib.zip *Unzip it add the .jar file to the ‘lib’ folder *Create a folder called ‘resources’ and put the .properties file in it

7 The ‘lib’ & ‘resources’ folder

8 Main Classes Create two classes Controller MyCrawler
Controller MyCrawler

9 Controller Class At this time you should see some errors
Oops forgot to import the .jar files!

10 Add External Jars Select All and press open and then OK

11 Add Sources To The Classpath

12 Controller:Setting The Parameters

13 MyCralwer: Main Methods
shouldVisit(WebURL url) Should I put this URL in frontier or not? visit(Page page) How should I process this page coming from the head of the frontier? page.getWebURL().getURL(); page.getHTML(); page.getText(); page.getURLs() Example

14 An Example

15 Content Articles http://en.wikipedia.org/wiki/Bing_search_engine
public static boolean isArticle(String titlePartOfUrl) { if (titlePartOfUrl.startsWith("Image:") || titlePartOfUrl.startsWith("Wikipedia:") || titlePartOfUrl.startsWith("Category:")|| titlePartOfUrl.startsWith("Special:") || titlePartOfUrl.startsWith("Image_talk:")|| titlePartOfUrl.startsWith("Portal:")|| titlePartOfUrl.startsWith("Wikipedia_talk:") || titlePartOfUrl.startsWith("User:")|| titlePartOfUrl.startsWith("Template:")|| titlePartOfUrl.startsWith("Template_talk:") || titlePartOfUrl.startsWith("Help:")|| titlePartOfUrl.startsWith("Talk:")|| titlePartOfUrl.startsWith("User_talk:") || titlePartOfUrl.startsWith("Category_talk:") || titlePartOfUrl.startsWith("Media:")|| titlePartOfUrl.startsWith("MediaWiki:") || titlePartOfUrl.startsWith("File:") || titlePartOfUrl.startsWith("MediaWiki_Talk:")) {return false;} return true;}

16 Main Questions To Answer
How to count unique terms, what data structure? How to write the result in file(s)? How to solve concurrency problems that might happen? Static Synchronized Atomic Integer

17 Example: IO I have only one file and all threads(crawlers) write in it
Each thread has its own file and I merge all files when the threads threads(crawlers) are done.

18 Sample Code Snippet private static PrintStream out; static { try {
out = new PrintStream("/home/sara/Wikipedia2/Train-Test-Features/testUserPageStatus.txt"); } catch (FileNotFoundException e) { e.printStackTrace(); }} public MyCrawler() { }


Download ppt "INF 141: Information Retrieval"

Similar presentations


Ads by Google