Artface (Automated reorganization to fit approximate client expectations) Mike Venzke 9/19/2018
Artface Goals Provide a method for determining the approximate expectation of a web client Examine feasibility of using this information in an automated manner 9/19/2018
Description Using Open Directory categories, create a model for classifying web pages. Fetch, parse, and classify the referring page of local web hits. As a result, have the approximate expectations people have when they go to different parts of your website. 9/19/2018
Classification Categories Used DMOZ categories Already classified web pages; provides good training data. Went 3 levels deep in directory Wanted to get approximate expectation, not so specific that very similar items are considered different. Time and constraints 9/19/2018
Page Fetching Used Python SGMLParser module Good at parsing out irrelevant data Fast enough Easy to use 9/19/2018
Classification Rainbow – LGPL’d Naïve Bayesian text classifier Used ~ 9000 documents as training data, with expanded category as classification. ~7000 test pages taken from web logs of www.cs.rpi.edu and www.linenplace.com 9/19/2018
Data Results Fairly accurate results http://webgraph.canbelearned.com 9/19/2018
Automation Possibilities Determine ‘good’ categories by self-site classification or user input Track traffic from ‘good’ categories and provide higher-level links to local pages. Set of bad categories is small and generally universal. Take action against local sites based on how they’re being used, not what they have. 9/19/2018
Automation Possibilities (contd) Provide custom pages based on what user expected, rather than what page contains. May not have found what they wanted. May be interested in a more broad topic. 9/19/2018
Process Enhancement Ideas More training data Use all levels of DMOZ data, but push classification up to threshold level. Handle more page errors Scripting, authentication errors provide false data. Remove or special-parse ‘classless’ information pages Search engines 9/19/2018