1 Automatic Classification of Bookmarked Web Pages Chris Staff First Talk February 2007.

1 Automatic Classification of Bookmarked Web Pages Chris Staff First Talk February 2007

2 Overview General Principles Reading List Tasks involved Schedule

3 General Principles Email: cstaff@cs.um.edu.mtcstaff@cs.um.edu.mt Web site: http://www.cs.um.edu.mt/~cstaffhttp://www.cs.um.edu.mt/~cstaff Plagiarism Referencing ACM Digital Library: Membership for students from MaltaMembership for students from Malta

4 Reading List –Abrams, D., Baecker, R.: How people use WWW bookmarks. In: CHI ’97: CHI ’97 extended abstracts on Human factors in computing systems, New York, NY, USA, ACM Press (1997) 341-342 –Bugeja, I.: Managing WWW browser’s bookmarks and history (a Firefox extension). Final year project report, Department of Computer Science & AI, University of Malta, 2006. http://hyper.iannet.org/hyperBkreport.pdf –Cockburn, A., McKenzie, B.: What do web users do? an empirical analysis of web use. In: Int. J. Hum.-Comput. Stud. 54(6) (2001) 903-922 –Staff, C.: Automatic Classification of Web Pages into Bookmark Categories. Submitted to UM’07, 2007. –Staff, C.: CSA3200 User Adaptive Systems Lecture Notes, 2006. Follow link from http://www.cs.um.edu.mt/~cstaff/ –Mozilla Development Center: 2006, “Building an Extension”., http://developer.mozilla.org/en/docs/Building_an_Extension

5 Classifying Bookmarks When a user bookmarks a page (or adds a page to Favorites) we want to recommend the best existing category –Improvement over simply recommending last category saved to –Improvement over simply offering ‘category root’

6 Tasks 1.Representation of bookmark categories 2.Two clustering/similarity algorithms 3.Extra utility 4.User interface 5.Evaluation 6.Write up report

7 Tasks Overview We are going to implement a number of algorithms to help with the overall task. –Some of these will be used while the user is browsing –Others will be used to classify pages ‘off-line’ (especially for the existing bookmark files) We’re going to have a ‘standard test bed’ for conducting the evaluation

8 Tasks Overview Represent bookmark categories –We’re starting with populated bookmark files, so use ‘How Did I Find That?’ approach –Plus another, individual approach When a page is to be bookmarked –If referrer page is available, identify topic of page –Otherwise, identify page topic using ‘How Did I Find That?’ approach Compare current topic topic to bookmark category representations

9 Tasks Overview User Interface –To replace the built in ‘Bookmark this Page’ menu item and keyboard command –To display a new dialog box to users to offer choice of recommended category, last category used, and to allow user to select some other category or create a new category

10 Tasks Overview Evaluation –Will be standard and automated –For testing purposes, download test_eval.zip from home page Contains 2x8 bookmark files (.html) and one URL file (.txt) Bookmark files are ‘real’ files collected one year ago URL file contains a number of lines with following format: –Bk file ID, URL of bookmarked page, home category, exact entry from bookmark file (with date created, etc.)

11 Tasks Overview Evaluation (continued) –Challenge to also ‘re-create’ bookmark file in the order that it was created by users –Eventually, close to the end of the APT, the evaluation test data sets will be made available About 20 unseen bookmark files and one URL file –Same format as before –You’ll get bookmark files early to prepare representations, but classification run will be part of a demo session

12 Tasks Overview Write up report –We’ll spend some time looking at the structure of a scientific report, how to write a literature review, present evaluation results, etc.

13 Task: Representing Bookmark Categories We need to identify what a category or collection of bookmarks is about so that we can check if a new page could belong to that category Ideally, we find out what is similar between the different documents in the category (especially if we know which link a user followed to reach child!) In the absence of this information use: –One algorithm will be based on ‘How Did I Find That?’ –A second algorithm that is up to you

14 Task: Two clustering/similarity algorithms Once we have represented the categories, we can ‘send’ page to be bookmarked to best category –Similar to ‘information filtering’ or ‘clustering’ –What similarity measure or clustering algorithm to use? One way of representing page to be classified will be based on ‘How Did I Find That?’ Other way researched/developed by you

15 Task: Extra Utility How can the classification of web pages to be bookmarked be improved? –What particular interests do you have, and how can they be used to improve classification? E.g., synonym detection, automatic reorganisation of bookmarks, …

16 Task: User Interface Can use XUL to ‘extend’ Mozilla Firefox –http://www.xulplanet.com/tutorials/xultu/ Use Ian Bugeja’s HyperBK as a framework (with due referencing and acknowledgement, of course): https://addons.mozilla.org/firefox/2539/ https://addons.mozilla.org/firefox/2539/ Programs are likely to be JavaScript Your extension will then be portable

17 Task: User Interface You can use Ian’s interface, but it may need some work to tweak it: –To support some of the new functionality that you’re adding (e.g. choice of algorithms) –And to fix some of the usability problems with the dialog box

18 Task: Evaluation ACofBWP will be evaluated! But you must build a version of the program that can be called in batch mode; that will accept a directory containing bookmark files and a URL file; that will run in two modes (classify and reconstruct); and that will report faithfully on its performance.

19 Task: Write Up Report At least one tutorial will be dedicated to good report writing practice; how to write a literature review; how to build and write references; how to present evaluation results.

20 Grading Structure 10% for obtaining an average of at least 0.8 precision on evaluation (for random bookmark classification, using either implemented approach) 10% for incurring a maximum 2 second overhead on average to classify a page (must faithfully report time overhead) Max. 10% for extra utility. 40% Report 15% Presentation 15% Artifact Design/Implementation

21 Future Opportunities FYP supervision Opportunity to co-author research paper that will be submitted to leading IR/AH/UM conference (irrespective of FYP)

22 Pitfalls Utilities must be lightweight –Mostly those that are interactive, or that are invoked while user is browsing Should all of a document be used to contribute to a category representation/be used in a similarity measure?

23 Schedule Until w.c. 6th March inc: Discussion, talks once/week w.c 19th March: Submit TOC/chapter overview for feedback (optional) w.c. 23th Apr: Demo 1 (optional) 23th Apr-7th May: Submit one chapter of your choice for feedback (optional) w.c. 7th May: Demo 2 (optional) 14th May: Evaluation collection will be made available May 25: Submit APT report June: Demo and evaluation under exam conditions

1 Automatic Classification of Bookmarked Web Pages Chris Staff First Talk February 2007.

Similar presentations

Presentation on theme: "1 Automatic Classification of Bookmarked Web Pages Chris Staff First Talk February 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Automatic Classification of Bookmarked Web Pages Chris Staff First Talk February 2007.

Similar presentations

Presentation on theme: "1 Automatic Classification of Bookmarked Web Pages Chris Staff First Talk February 2007."— Presentation transcript:

Similar presentations

About project

Feedback